open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.59k stars 9.46k forks source link

implement a model of FSAF model in distributed GPU's has an error of 'utf-8' codec can't decode byte 0x80 #8911

Open alaa-shubbak opened 2 years ago

alaa-shubbak commented 2 years ago

Prerequisite

💬 Describe the reimplementation questions

I am not sure if it is a bug or error in implementation

I run this code for distribution training for a model fsaf: bash tools/dist_train.sh configs/machine/fsaf_r101_fpn_1x_coco.py 4 --work-dir train_results/fsaf_ciou_resnet101

I run it within a bash script with activate python environment.

the error massage is as bellow: recording error1

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

the model is being able to run with one gpu outside the bash script , but with low speed , so i would like to use the distribution training on 4 gpu's to make it faster.

Environment

my environment after running 'python mmdet/utils/collect_env.py' is collect env

I created the environment using the python environment as the coda is not working with the server system i have. I installed pytorch using pip from the official site.

Expected results

there is no results until now, as the model is not running.

Additional information

i did some modification on my model which i am sure it is correct. as i understand each steps i used my own dataset which has similar format as the one of coco dataset and json files are valid i don't have any idea why this error occurred.

RangiLyu commented 2 years ago

Please check your config file to avoid invalid characters.

alaa-shubbak commented 2 years ago

Thanks for replying . I am trying to run fsaf model with backbone Resnet 101 and Empirical Attention block https://github.com/open-mmlab/mmdetection/tree/master/configs/empirical_attention

and from this i got such error

by the way , it is working correctly in the colab and not working on my cluster system . is the changing in interval value from 50 to 100 as bellow could case that:

log_config = dict( interval=100, hooks=[ dict(type='TextLoggerHook'),

dict(type='TensorboardLoggerHook')

])
alaa-shubbak commented 2 years ago

I checked my config : it is as bellow :

model = dict( type='FSAF', backbone=dict( type='ResNet', depth=101, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet101'), plugins=[ dict( cfg=dict( type='GeneralizedAttention', spatial_range=-1, num_heads=8, attention_type='0010', kv_stride=2), stages=(False, False, True, True), position='after_conv2') ]), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=1, add_extra_convs='on_input', num_outs=5), bbox_head=dict( type='FSAFHead', num_classes=80, in_channels=256, stacked_convs=4, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', octave_base_scale=1, scales_per_octave=1, ratios=[1.0], strides=[8, 16, 32, 64, 128]), bbox_coder=dict(type='TBLRBBoxCoder', normalizer=4.0), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0, reduction='none'), loss_bbox=dict( type='GIoULoss', eps=1e-06, loss_weight=1.0, reduction='none'), reg_decoded_bbox=True), train_cfg=dict( assigner=dict( type='CenterRegionAssigner', pos_scale=0.2, neg_scale=0.2, min_pos_iof=0.01), allowed_border=-1, pos_weight=-1, debug=False), test_cfg=dict( nms_pre=1000, min_bbox_size=0, score_thr=0.05, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100)) dataset_type = 'CocoDataset' data_root = '../../../scratch/shubbak/large_files/coco/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict( type='CocoDataset', ann_file= '../../../scratch/shubbak/large_files/coco/annotations/instances_train2017.json', img_prefix= '../../../scratch/shubbak/large_files/coco/images/train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='CocoDataset', ann_file= '../../../scratch/shubbak/large_files/coco/annotations/instances_val2017.json', img_prefix='../../../scratch/shubbak/large_files/coco/images/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file= '../../../scratch/shubbak/large_files/coco/annotations/instances_val2017.json', img_prefix='../../../scratch/shubbak/large_files/coco/images/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='bbox') optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=10, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.001, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=60) checkpoint_config = dict(interval=1) log_config = dict(interval=1000, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [dict(type='NumClassCheckHook')] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_scale_lr = dict(enable=False, base_batch_size=16) work_dir = '../../../scratch/shubbak/large_files/train_results/coco/fsaf_giou_resnet101_att' auto_resume = False gpu_ids = [0]

I ran my config into 1 gpu , and it is running correctly. but when switch to multiple gpu (4 gpu's) , this error is presented.

my environment is as bellow :

TorchVision: 0.10.0+cu111 OpenCV: 4.6.0 MMCV: 1.6.2 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.25.2+9d3e162

i made the environment of open_mmlab using python environment not by create conda environment, but it is working correctly with 1 gpu . do you think this could be the reason ?

RangiLyu commented 2 years ago

So it seems that your config file codec is correct. How do you launch your distribution training processes? Using torch.distributed.launch or using slurm command?

alaa-shubbak commented 2 years ago

I used torch.distributed.launch the exactly function of this bash ./tools/dist_train.sh

but i am working on the high computation system of my university , not on my own pc with multiple gpu's

alaa-shubbak commented 2 years ago

can i change the base_batch_size in the deafult_runtime.py instead to be base_batch_size=16 become base_batch_size=32 then change the number of batch_size= samples_per_gpu = 8 and keep workers_per_gpu=2 and use 4 gpu's

would that be possible and increase the speed of train my model within 4 gpu's

alaa-shubbak commented 2 years ago

So it seems that your config file codec is correct. How do you launch your distribution training processes? Using torch.distributed.launch or using slurm command?

I am trying also with slurm , i got such error after running the command directly into the terminal.

slrum running

I don't what was the exact issue or problem?

I am doing the training also by send a .sh script to the high computation of the university.

In general , I use such type of script: slurm script

i am not sure if i wrote it correctly. as it still needs time to be sent to the system and run.

any suggestion or help, I will be so grateful

RangiLyu commented 2 years ago

would that be possible and increase the speed of train my model within 4 gpu's

It depends on what the bottleneck of your training process is. Sometimes only increasing the batch size does not work because the bottleneck is the cpu data processing or the data IO. Usually you need to increase the workers_per_gpu if you use a larger batch for training.

RangiLyu commented 2 years ago

i am not sure if i wrote it correctly. as it still needs time to be sent to the system and run.

The error message shows that the slurm's srun command is incorrect. I don't know what is the required command of your school's cluster. Maybe you need to follow it's document or ask IT in your school.

alaa-shubbak commented 2 years ago

I am trying again to run my model with only 2 gpus , on a huge dataset (the whole coco dataset).

my environment is as bellow : environment 2gpus

and i got the same error : error today

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

and i think my bottleneck is the huge dataset in additional to using attention mechanism. i used the script of distribution training (./tools/dist_train.sh). I make the batch_size =8 and worker_per_gpu=4

please notice also when i run the model in one gpu , it is working but with multiple gpu (for example here is 2 gpus) it does not work