huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
843 stars 175 forks source link

Specific database #242

Open vanessasidrim opened 2 years ago

vanessasidrim commented 2 years ago

Is it possible to run sp-nas with its own database (unlike mscoco, pascalvoc...)?

zhangjiajin commented 2 years ago

@vanessasidrim

Can not use custom datasets directly.

There are two options:

  1. Implement your dataset class and register it into Vega.
  2. Or convert your dataset to Coco format.
vanessasidrim commented 2 years ago

I performed the conversion of my database to coco format, I managed to execute the sp-nas but the training and validation results (mAP and AP) are zeroed. Is it necessary to make any changes to the metrics code as well? In the implementation, do you use the MSCOCO API to generate these metrics?

zhangjiajin commented 2 years ago

@vanessasidrim

You just need to change the data format.

Please attach run logs to help resolve this issue. <task id>/logs/

vanessasidrim commented 2 years ago

attached the logs: fine_tune_worker_0.log parallel_worker_1.log parallel_worker_2.log parallel_worker_3.log parallel_worker_4.log parallel_worker_5.log parallel_worker_6.log parallel_worker_7.log parallel_worker_8.log parallel_worker_9.log parallel_worker_10.log parallel_worker_11.log parallel_worker_12.log parallel_worker_13.log parallel_worker_14.log parallel_worker_15.log parallel_worker_16.log parallel_worker_17.log parallel_worker_18.log parallel_worker_19.log parallel_worker_20.log parallel_worker_21.log pipeline.log reignition_worker_8.log reignition_worker_15.log serial_worker_1.log serial_worker_2.log serial_worker_3.log serial_worker_4.log serial_worker_5.log serial_worker_6.log serial_worker_7.log serial_worker_8.log serial_worker_9.log serial_worker_10.log serial_worker_11.log serial_worker_12.log serial_worker_13.log serial_worker_14.log serial_worker_15.log serial_worker_16.log serial_worker_17.log serial_worker_18.log serial_worker_19.log serial_worker_20.log

zhangjiajin commented 2 years ago

@vanessasidrim

It is possible that the number of classification does not match the pre-trained model. Adjust the number of classifications to finetune and check whether the precision increases.


general:
        task: 
            local_base_path: /VEGA/vega/examples/nas/sp_nas/tasks

pipeline: [fine_tune]                 # <-- Only finetune. Check whether the precision increases.

fine_tune:
    pipe_step:
        type: TrainPipeStep

    model:
        pretrained_model_file: /VEGA/vega/examples/nas/sp_nas/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
        model_desc:
            type: FasterRCNN
            convert_pretrained: True
            num_classes: <classes of your dataset>                 # <- set number of classes
            backbone:
                type: SerialBackbone

    trainer:
        type: Trainer
        epochs: 25                               # <-- fine tune 25 epochs
        # with_train: False                    # <-- disable this parameter
        optimizer:
            type: SGD
            params:
                lr: 0.02
                momentum: 0.9
                weight_decay: !!float 1e-4
        lr_scheduler:
            type: WarmupScheduler
            by_epoch: False
            params:
                warmup_type: linear
                warmup_iters: 1000
                warmup_ratio: 0.001
                after_scheduler_config:
                    type: MultiStepLR
                    by_epoch: True
                    params:
                        milestones: [ 10, 20 ]
                        gamma: 0.1
        loss:
            type: SumLoss
        metric:
            type: coco
            params:
                anno_path: /VEGA/isolador_coco/annotations/instances_val2017.json

    dataset:
        type: CocoDataset
        common:
            data_root: /VEGA/isolador_coco/
            batch_size: 4
            img_prefix: "2017"
            ann_prefix: instances
vanessasidrim commented 2 years ago

I ran with this configuration and got the following error: Unexpected key(s) in state_dict for conversion: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score .weight torch.Size([1, 1024])

zhangjiajin commented 2 years ago

@vanessasidrim

Please update the config, specify the head name:

    model:
        pretrained_model_file: /VEGA/vega/examples/nas/sp_nas/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
        head: roi_heads                              # <-- specify the head name
        model_desc:
            type: FasterRCNN
            convert_pretrained: True
            num_classes: <classes of your dataset>
            backbone:
                type: SerialBackbone

And update the file: vega/networks/faster_rcnn.py

vanessasidrim commented 2 years ago

same error occurred after changes

Unexpected key(s) in state_dict for conversion: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score .weight torch.Size([1, 1024])

zhangjiajin commented 2 years ago

My logs:

Before change code:

2022-06-09 02:22:12.359 INFO ------------------------------------------------
2022-06-09 02:22:12.359 INFO   Step: fine_tune
2022-06-09 02:22:12.359 INFO ------------------------------------------------
2022-06-09 02:22:12.366 INFO init TrainPipeStep...
2022-06-09 02:22:12.366 INFO TrainPipeStep started...
2022-06-09 02:22:12.798 INFO Model was created.
2022-06-09 02:22:12.799 INFO load model weights from file, weights file=/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
2022-06-09 02:22:12.969 ERROR Failed to run worker, id: 0, message: Unexpected key(s) in state_dict for convert: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score.weight torch.Size([1, 1024])
2022-06-09 02:22:15.13 INFO ------------------------------------------------

After change code:

2022-06-09 02:28:34.877 INFO ------------------------------------------------
2022-06-09 02:28:34.879 INFO   Step: fine_tune
2022-06-09 02:28:34.879 INFO ------------------------------------------------
2022-06-09 02:28:34.885 INFO init TrainPipeStep...
2022-06-09 02:28:34.885 INFO TrainPipeStep started...
2022-06-09 02:28:35.360 INFO Model was created.
2022-06-09 02:28:35.361 INFO load model weights from file, weights file=/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
2022-06-09 02:28:35.533 INFO Not Swap Keys: ['roi_heads.box_head.fc6.weight', 'roi_heads.box_head.fc6.bias', 'roi_heads.box_head.fc7.weight', 'roi_heads.box_head.fc7.bias', 'roi_heads.box_predictor.cls_score.weight', 'roi_heads.box_predictor.cls_score.bias', 'roi_heads.box_predictor.bbox_pred.weight', 'roi_heads.box_predictor.bbox_pred.bias']
loading annotations into memory...
Done (t=0.67s)
creating index...
index created!
loading annotations into memory...
Done (t=0.67s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
2022-06-09 02:28:44.695 INFO flops: 177.56315298500002 , params:41347.156
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed. 
      ^-- This issue is caused by a mismatch between the number of categories in the dataset and the number of categories in the configuration file. This is an expected log.
2022-06-09 02:28:45.618 ERROR Failed to run worker, id: 0, message: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
2022-06-09 02:28:47.628 INFO ------------------------------------------------

The issue has been resolved.

Check whether the configuration file contains:

head: roi_heads

And the file vega/networks/faster_rcnn.py is replaced correctly.

  1. Find Vega's localtion.
~/repo/automl$ pip3 show noah-vega
Name: noah-vega
Version: 1.8.0
Summary: AutoML Toolkit
Home-page: https://github.com/huawei-noah/vega
Author: Huawei Noah's Ark Lab
Author-email: 
License: Apache License 2.0
Location: /home/user/.local/lib/python3.7/site-packages      <--- here
Requires: click, distributed, numpy, opencv-python, pandas, pillow, psutil, PyYAML, pyzmq, scikit-learn, scipy, tensorboardX, thop
Required-by: 
  1. Replace the following file:
/home/user/.local/lib/python3.7/site-packages/vega/networks/faster_rcnn.py
vanessasidrim commented 2 years ago

I managed to run but the results at all times are current valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000], best valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000]

zhangjiajin commented 2 years ago

@vanessasidrim

Is the information in the NAS or finetune phase?

vanessasidrim commented 2 years ago

this is the return of finetune phase execution

zhangjiajin commented 2 years ago

@vanessasidrim

That's because the predicted results didn't hit. The accuracy is -1. The dataset may be labeled incorrectly.

vanessasidrim commented 2 years ago

Could you tell me if the segmentation values ​​impact the calculation of these metrics?

As my dataset was in VOC format I performed the conversion to COCO format and this information was non-existent but mandatory, as I am only interested in detection I inserted random values ​​for this key in the .json file

zhangjiajin commented 2 years ago

@vanessasidrim

The segmentation values do not ​​impact the calculation of these metrics.

We found a setting that needs to adjust the number of classes in the dataset, as shown in the following:

    dataset:
        type: CocoDataset
        common:
            data_root: /VEGA/isolador_coco/
            num_classes: 1          # <--- here
            batch_size: 4
            img_prefix: "2017"
            ann_prefix: instances

We are also trying to convert the VOC format to the COCO format to see if there are other issues.

zhangjiajin commented 2 years ago

@vanessasidrim

I used the tool voc2coco to change the format of BCCD_Dataset to COCO. https://github.com/yukkyo/voc2coco https://github.com/Shenggan/BCCD_Dataset

Then changed the image ID from string to integer, such as id and image_id.

    "images": [
        {
            "file_name": "BloodImage_00000.jpg",
            "height": 480,
            "width": 640,
            "id": 0
        }
    ]
    "annotations": [
        {
            "area": 46400,
            "iscrowd": 0,
            "bbox": [
                259,
                176,
                232,
                200
            ],
            "category_id": 3,
            "ignore": 0,
            "segmentation": [],
            "image_id": 0,
            "id": 1
        },

Run the following command to perform fine tuning:

pipeline: [fine_tune]

fine_tune:
    pipe_step:
        type: TrainPipeStep

    model:
        pretrained_model_file: /cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
        head: roi_heads
        model_desc:
            type: FasterRCNN
            convert_pretrained: True
            num_classes: 4
            backbone:
                type: SerialBackbone

    trainer:
        type: Trainer
        epochs: 25
        # with_train: False
        optimizer:
            type: SGD
            params:
                lr: 0.02
                momentum: 0.9
                weight_decay: !!float 1e-4
        lr_scheduler:
            type: WarmupScheduler
            by_epoch: False
            params:
                warmup_type: linear
                warmup_iters: 1000
                warmup_ratio: 0.001
                after_scheduler_config:
                    type: MultiStepLR
                    by_epoch: True
                    params:
                        milestones: [ 10, 20 ]
                        gamma: 0.1
        loss:
            type: SumLoss
        metric:
            type: coco
            params:
                anno_path: /datasets/voc_coco/bccd_coco/annotations/instances_val2017.json

    dataset:
        type: CocoDataset
        common:
            data_root: /datasets/voc_coco/bccd_coco/
            batch_size: 4
            img_prefix: "2017"
            ann_prefix: instances
            num_classes: 3
            test_size: 1

Note that the value of num_classes in model_desc is 4 and the value of num_classes in dataset is 3, because the dataset type in the configuration file of the dataset is 1, 2, and 3, and does not start from 0.

zhangjiajin commented 2 years ago

@vanessasidrim

In the 14th epoch, the gradient explodes, and all of the metrics are -1.

2022-06-23 02:26:53.528 INFO worker id [0], epoch [13/25], current valid perfs [mAP: 36.574, AP50: 77.358, AP_small: 4.000, AP_medium: 24.747, AP_large: 49.130], best valid perfs [mAP: 52.552, AP50: 82.835, AP_small: 9.299, AP_medium: 37.015, AP_large: 62.955]
2022-06-23 02:26:55.360 INFO worker id [0], epoch [14/25], train step [ 0/51], loss [   0.733,    0.733], lr [   0.0132468],  time pre batch [0.992s] , total mean time per batch [0.992s]
2022-06-23 02:27:05.961 INFO worker id [0], epoch [14/25], train step [10/51], loss [   0.776,    0.732], lr [   0.0134466],  time pre batch [0.986s] , total mean time per batch [1.006s]
2022-06-23 02:27:16.724 INFO worker id [0], epoch [14/25], train step [20/51], loss [274910.344, 13100.794], lr [   0.0136464],  time pre batch [0.998s] , total mean time per batch [1.006s]
2022-06-23 02:27:27.158 INFO worker id [0], epoch [14/25], train step [30/51], loss [     nan,      nan], lr [   0.0138462],  time pre batch [0.970s] , total mean time per batch [1.006s]
2022-06-23 02:27:38.9 INFO worker id [0], epoch [14/25], train step [40/51], loss [     nan,      nan], lr [   0.0140460],  time pre batch [1.000s] , total mean time per batch [1.006s]
2022-06-23 02:27:48.928 INFO worker id [0], epoch [14/25], train step [50/51], loss [     nan,      nan], lr [   0.0142458],  time pre batch [1.006s] , total mean time per batch [1.006s]
2022-06-23 02:27:58.31 INFO worker id [0], epoch [14/25], current valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000], best valid perfs [mAP: 52.552, AP50: 82.835, AP_small: 9.299, AP_medium: 37.015, AP_large: 62.955]

Adjust the learning rate to 1/2 of the original value.

        optimizer:
            type: SGD
            params:
                lr: 0.01
                momentum: 0.9
                weight_decay: !!float 1e-4

Training success:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.537
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.797
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.615
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.147
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.653
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.375
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.577
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.157
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.485
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.725
2022-06-23 03:07:05.383 INFO worker id [0], epoch [25/25], current valid perfs [mAP: 53.677, AP50: 79.720, AP_small: 14.728, AP_medium: 36.768, AP_large: 65.289], best valid perfs [mAP: 54.138, AP50: 80.392, AP_small: 13.985, AP_medium: 37.537, AP_large: 65.649]
2022-06-23 03:07:06.133 INFO flops: 177.578512985 , params:41362.531
2022-06-23 03:07:06.133 INFO Finished the unified trainer successfully.
2022-06-23 03:07:08.335 INFO ------------------------------------------------
2022-06-23 03:07:08.335 INFO   Pipeline end.
2022-06-23 03:07:08.335 INFO 
2022-06-23 03:07:08.335 INFO   task id: 0623.023919.824
2022-06-23 03:07:08.335 INFO   output folder: /data/tasks/0623.023919.824/output
2022-06-23 03:07:08.335 INFO 
2022-06-23 03:07:08.336 INFO   running time:
2022-06-23 03:07:08.336 INFO          fine_tune:  0:27:44  [2022-06-23 02:39:21.599546 - 2022-06-23 03:07:06.334006]
2022-06-23 03:07:08.336 INFO 
2022-06-23 03:07:08.343 INFO   result:
2022-06-23 03:07:08.343 INFO     0:  {'flops': 177.578512985, 'params': 41362.531, 'mAP': 54.137894672785315, 'AP50': 80.39203084795886, 'AP_small': 13.985148514851486, 'AP_medium': 37.537002484435, 'AP_large': 65.64869730651068}
2022-06-23 03:07:08.344 INFO ------------------------------------------------