Open vanessasidrim opened 2 years ago
@vanessasidrim
Can not use custom datasets directly.
There are two options:
I performed the conversion of my database to coco format, I managed to execute the sp-nas but the training and validation results (mAP and AP) are zeroed. Is it necessary to make any changes to the metrics code as well? In the implementation, do you use the MSCOCO API to generate these metrics?
@vanessasidrim
You just need to change the data format.
Please attach run logs to help resolve this issue.
<task id>/logs/
attached the logs: fine_tune_worker_0.log parallel_worker_1.log parallel_worker_2.log parallel_worker_3.log parallel_worker_4.log parallel_worker_5.log parallel_worker_6.log parallel_worker_7.log parallel_worker_8.log parallel_worker_9.log parallel_worker_10.log parallel_worker_11.log parallel_worker_12.log parallel_worker_13.log parallel_worker_14.log parallel_worker_15.log parallel_worker_16.log parallel_worker_17.log parallel_worker_18.log parallel_worker_19.log parallel_worker_20.log parallel_worker_21.log pipeline.log reignition_worker_8.log reignition_worker_15.log serial_worker_1.log serial_worker_2.log serial_worker_3.log serial_worker_4.log serial_worker_5.log serial_worker_6.log serial_worker_7.log serial_worker_8.log serial_worker_9.log serial_worker_10.log serial_worker_11.log serial_worker_12.log serial_worker_13.log serial_worker_14.log serial_worker_15.log serial_worker_16.log serial_worker_17.log serial_worker_18.log serial_worker_19.log serial_worker_20.log
@vanessasidrim
It is possible that the number of classification does not match the pre-trained model. Adjust the number of classifications to finetune and check whether the precision increases.
general:
task:
local_base_path: /VEGA/vega/examples/nas/sp_nas/tasks
pipeline: [fine_tune] # <-- Only finetune. Check whether the precision increases.
fine_tune:
pipe_step:
type: TrainPipeStep
model:
pretrained_model_file: /VEGA/vega/examples/nas/sp_nas/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
model_desc:
type: FasterRCNN
convert_pretrained: True
num_classes: <classes of your dataset> # <- set number of classes
backbone:
type: SerialBackbone
trainer:
type: Trainer
epochs: 25 # <-- fine tune 25 epochs
# with_train: False # <-- disable this parameter
optimizer:
type: SGD
params:
lr: 0.02
momentum: 0.9
weight_decay: !!float 1e-4
lr_scheduler:
type: WarmupScheduler
by_epoch: False
params:
warmup_type: linear
warmup_iters: 1000
warmup_ratio: 0.001
after_scheduler_config:
type: MultiStepLR
by_epoch: True
params:
milestones: [ 10, 20 ]
gamma: 0.1
loss:
type: SumLoss
metric:
type: coco
params:
anno_path: /VEGA/isolador_coco/annotations/instances_val2017.json
dataset:
type: CocoDataset
common:
data_root: /VEGA/isolador_coco/
batch_size: 4
img_prefix: "2017"
ann_prefix: instances
I ran with this configuration and got the following error: Unexpected key(s) in state_dict for conversion: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score .weight torch.Size([1, 1024])
@vanessasidrim
Please update the config, specify the head name:
model:
pretrained_model_file: /VEGA/vega/examples/nas/sp_nas/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
head: roi_heads # <-- specify the head name
model_desc:
type: FasterRCNN
convert_pretrained: True
num_classes: <classes of your dataset>
backbone:
type: SerialBackbone
And update the file: vega/networks/faster_rcnn.py
same error occurred after changes
Unexpected key(s) in state_dict for conversion: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score .weight torch.Size([1, 1024])
My logs:
Before change code:
2022-06-09 02:22:12.359 INFO ------------------------------------------------
2022-06-09 02:22:12.359 INFO Step: fine_tune
2022-06-09 02:22:12.359 INFO ------------------------------------------------
2022-06-09 02:22:12.366 INFO init TrainPipeStep...
2022-06-09 02:22:12.366 INFO TrainPipeStep started...
2022-06-09 02:22:12.798 INFO Model was created.
2022-06-09 02:22:12.799 INFO load model weights from file, weights file=/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
2022-06-09 02:22:12.969 ERROR Failed to run worker, id: 0, message: Unexpected key(s) in state_dict for convert: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score.weight torch.Size([1, 1024])
2022-06-09 02:22:15.13 INFO ------------------------------------------------
After change code:
2022-06-09 02:28:34.877 INFO ------------------------------------------------
2022-06-09 02:28:34.879 INFO Step: fine_tune
2022-06-09 02:28:34.879 INFO ------------------------------------------------
2022-06-09 02:28:34.885 INFO init TrainPipeStep...
2022-06-09 02:28:34.885 INFO TrainPipeStep started...
2022-06-09 02:28:35.360 INFO Model was created.
2022-06-09 02:28:35.361 INFO load model weights from file, weights file=/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
2022-06-09 02:28:35.533 INFO Not Swap Keys: ['roi_heads.box_head.fc6.weight', 'roi_heads.box_head.fc6.bias', 'roi_heads.box_head.fc7.weight', 'roi_heads.box_head.fc7.bias', 'roi_heads.box_predictor.cls_score.weight', 'roi_heads.box_predictor.cls_score.bias', 'roi_heads.box_predictor.bbox_pred.weight', 'roi_heads.box_predictor.bbox_pred.bias']
loading annotations into memory...
Done (t=0.67s)
creating index...
index created!
loading annotations into memory...
Done (t=0.67s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
2022-06-09 02:28:44.695 INFO flops: 177.56315298500002 , params:41347.156
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
^-- This issue is caused by a mismatch between the number of categories in the dataset and the number of categories in the configuration file. This is an expected log.
2022-06-09 02:28:45.618 ERROR Failed to run worker, id: 0, message: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
2022-06-09 02:28:47.628 INFO ------------------------------------------------
The issue has been resolved.
Check whether the configuration file contains:
head: roi_heads
And the file vega/networks/faster_rcnn.py is replaced correctly.
~/repo/automl$ pip3 show noah-vega
Name: noah-vega
Version: 1.8.0
Summary: AutoML Toolkit
Home-page: https://github.com/huawei-noah/vega
Author: Huawei Noah's Ark Lab
Author-email:
License: Apache License 2.0
Location: /home/user/.local/lib/python3.7/site-packages <--- here
Requires: click, distributed, numpy, opencv-python, pandas, pillow, psutil, PyYAML, pyzmq, scikit-learn, scipy, tensorboardX, thop
Required-by:
/home/user/.local/lib/python3.7/site-packages/vega/networks/faster_rcnn.py
I managed to run but the results at all times are current valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000], best valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000]
@vanessasidrim
Is the information in the NAS or finetune phase?
this is the return of finetune phase execution
@vanessasidrim
That's because the predicted results didn't hit. The accuracy is -1. The dataset may be labeled incorrectly.
Could you tell me if the segmentation values impact the calculation of these metrics?
As my dataset was in VOC format I performed the conversion to COCO format and this information was non-existent but mandatory, as I am only interested in detection I inserted random values for this key in the .json file
@vanessasidrim
The segmentation values do not impact the calculation of these metrics.
We found a setting that needs to adjust the number of classes in the dataset, as shown in the following:
dataset:
type: CocoDataset
common:
data_root: /VEGA/isolador_coco/
num_classes: 1 # <--- here
batch_size: 4
img_prefix: "2017"
ann_prefix: instances
We are also trying to convert the VOC format to the COCO format to see if there are other issues.
@vanessasidrim
I used the tool voc2coco to change the format of BCCD_Dataset to COCO. https://github.com/yukkyo/voc2coco https://github.com/Shenggan/BCCD_Dataset
Then changed the image ID from string to integer, such as id
and image_id
.
"images": [
{
"file_name": "BloodImage_00000.jpg",
"height": 480,
"width": 640,
"id": 0
}
]
"annotations": [
{
"area": 46400,
"iscrowd": 0,
"bbox": [
259,
176,
232,
200
],
"category_id": 3,
"ignore": 0,
"segmentation": [],
"image_id": 0,
"id": 1
},
Run the following command to perform fine tuning:
pipeline: [fine_tune]
fine_tune:
pipe_step:
type: TrainPipeStep
model:
pretrained_model_file: /cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
head: roi_heads
model_desc:
type: FasterRCNN
convert_pretrained: True
num_classes: 4
backbone:
type: SerialBackbone
trainer:
type: Trainer
epochs: 25
# with_train: False
optimizer:
type: SGD
params:
lr: 0.02
momentum: 0.9
weight_decay: !!float 1e-4
lr_scheduler:
type: WarmupScheduler
by_epoch: False
params:
warmup_type: linear
warmup_iters: 1000
warmup_ratio: 0.001
after_scheduler_config:
type: MultiStepLR
by_epoch: True
params:
milestones: [ 10, 20 ]
gamma: 0.1
loss:
type: SumLoss
metric:
type: coco
params:
anno_path: /datasets/voc_coco/bccd_coco/annotations/instances_val2017.json
dataset:
type: CocoDataset
common:
data_root: /datasets/voc_coco/bccd_coco/
batch_size: 4
img_prefix: "2017"
ann_prefix: instances
num_classes: 3
test_size: 1
Note that the value of num_classes
in model_desc
is 4 and the value of num_classes
in dataset
is 3, because the dataset type in the configuration file of the dataset is 1, 2, and 3, and does not start from 0.
@vanessasidrim
In the 14th epoch, the gradient explodes, and all of the metrics are -1.
2022-06-23 02:26:53.528 INFO worker id [0], epoch [13/25], current valid perfs [mAP: 36.574, AP50: 77.358, AP_small: 4.000, AP_medium: 24.747, AP_large: 49.130], best valid perfs [mAP: 52.552, AP50: 82.835, AP_small: 9.299, AP_medium: 37.015, AP_large: 62.955]
2022-06-23 02:26:55.360 INFO worker id [0], epoch [14/25], train step [ 0/51], loss [ 0.733, 0.733], lr [ 0.0132468], time pre batch [0.992s] , total mean time per batch [0.992s]
2022-06-23 02:27:05.961 INFO worker id [0], epoch [14/25], train step [10/51], loss [ 0.776, 0.732], lr [ 0.0134466], time pre batch [0.986s] , total mean time per batch [1.006s]
2022-06-23 02:27:16.724 INFO worker id [0], epoch [14/25], train step [20/51], loss [274910.344, 13100.794], lr [ 0.0136464], time pre batch [0.998s] , total mean time per batch [1.006s]
2022-06-23 02:27:27.158 INFO worker id [0], epoch [14/25], train step [30/51], loss [ nan, nan], lr [ 0.0138462], time pre batch [0.970s] , total mean time per batch [1.006s]
2022-06-23 02:27:38.9 INFO worker id [0], epoch [14/25], train step [40/51], loss [ nan, nan], lr [ 0.0140460], time pre batch [1.000s] , total mean time per batch [1.006s]
2022-06-23 02:27:48.928 INFO worker id [0], epoch [14/25], train step [50/51], loss [ nan, nan], lr [ 0.0142458], time pre batch [1.006s] , total mean time per batch [1.006s]
2022-06-23 02:27:58.31 INFO worker id [0], epoch [14/25], current valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000], best valid perfs [mAP: 52.552, AP50: 82.835, AP_small: 9.299, AP_medium: 37.015, AP_large: 62.955]
Adjust the learning rate to 1/2 of the original value.
optimizer:
type: SGD
params:
lr: 0.01
momentum: 0.9
weight_decay: !!float 1e-4
Training success:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.537
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.797
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.615
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.147
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.653
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.375
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.577
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.621
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.157
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.485
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.725
2022-06-23 03:07:05.383 INFO worker id [0], epoch [25/25], current valid perfs [mAP: 53.677, AP50: 79.720, AP_small: 14.728, AP_medium: 36.768, AP_large: 65.289], best valid perfs [mAP: 54.138, AP50: 80.392, AP_small: 13.985, AP_medium: 37.537, AP_large: 65.649]
2022-06-23 03:07:06.133 INFO flops: 177.578512985 , params:41362.531
2022-06-23 03:07:06.133 INFO Finished the unified trainer successfully.
2022-06-23 03:07:08.335 INFO ------------------------------------------------
2022-06-23 03:07:08.335 INFO Pipeline end.
2022-06-23 03:07:08.335 INFO
2022-06-23 03:07:08.335 INFO task id: 0623.023919.824
2022-06-23 03:07:08.335 INFO output folder: /data/tasks/0623.023919.824/output
2022-06-23 03:07:08.335 INFO
2022-06-23 03:07:08.336 INFO running time:
2022-06-23 03:07:08.336 INFO fine_tune: 0:27:44 [2022-06-23 02:39:21.599546 - 2022-06-23 03:07:06.334006]
2022-06-23 03:07:08.336 INFO
2022-06-23 03:07:08.343 INFO result:
2022-06-23 03:07:08.343 INFO 0: {'flops': 177.578512985, 'params': 41362.531, 'mAP': 54.137894672785315, 'AP50': 80.39203084795886, 'AP_small': 13.985148514851486, 'AP_medium': 37.537002484435, 'AP_large': 65.64869730651068}
2022-06-23 03:07:08.344 INFO ------------------------------------------------
Is it possible to run sp-nas with its own database (unlike mscoco, pascalvoc...)?