Open alaa-shubbak opened 2 years ago
Please check your config file to avoid invalid characters.
Thanks for replying . I am trying to run fsaf model with backbone Resnet 101 and Empirical Attention block https://github.com/open-mmlab/mmdetection/tree/master/configs/empirical_attention
and from this i got such error
by the way , it is working correctly in the colab and not working on my cluster system . is the changing in interval value from 50 to 100 as bellow could case that:
log_config = dict( interval=100, hooks=[ dict(type='TextLoggerHook'),
])
I checked my config : it is as bellow :
model = dict( type='FSAF', backbone=dict( type='ResNet', depth=101, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet101'), plugins=[ dict( cfg=dict( type='GeneralizedAttention', spatial_range=-1, num_heads=8, attention_type='0010', kv_stride=2), stages=(False, False, True, True), position='after_conv2') ]), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=1, add_extra_convs='on_input', num_outs=5), bbox_head=dict( type='FSAFHead', num_classes=80, in_channels=256, stacked_convs=4, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', octave_base_scale=1, scales_per_octave=1, ratios=[1.0], strides=[8, 16, 32, 64, 128]), bbox_coder=dict(type='TBLRBBoxCoder', normalizer=4.0), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0, reduction='none'), loss_bbox=dict( type='GIoULoss', eps=1e-06, loss_weight=1.0, reduction='none'), reg_decoded_bbox=True), train_cfg=dict( assigner=dict( type='CenterRegionAssigner', pos_scale=0.2, neg_scale=0.2, min_pos_iof=0.01), allowed_border=-1, pos_weight=-1, debug=False), test_cfg=dict( nms_pre=1000, min_bbox_size=0, score_thr=0.05, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100)) dataset_type = 'CocoDataset' data_root = '../../../scratch/shubbak/large_files/coco/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict( type='CocoDataset', ann_file= '../../../scratch/shubbak/large_files/coco/annotations/instances_train2017.json', img_prefix= '../../../scratch/shubbak/large_files/coco/images/train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=(1333, 800), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='CocoDataset', ann_file= '../../../scratch/shubbak/large_files/coco/annotations/instances_val2017.json', img_prefix='../../../scratch/shubbak/large_files/coco/images/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file= '../../../scratch/shubbak/large_files/coco/annotations/instances_val2017.json', img_prefix='../../../scratch/shubbak/large_files/coco/images/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='bbox') optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=10, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.001, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=60) checkpoint_config = dict(interval=1) log_config = dict(interval=1000, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [dict(type='NumClassCheckHook')] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_scale_lr = dict(enable=False, base_batch_size=16) work_dir = '../../../scratch/shubbak/large_files/train_results/coco/fsaf_giou_resnet101_att' auto_resume = False gpu_ids = [0]
I ran my config into 1 gpu , and it is running correctly. but when switch to multiple gpu (4 gpu's) , this error is presented.
my environment is as bellow :
i made the environment of open_mmlab using python environment not by create conda environment
, but it is working correctly with 1 gpu . do you think this could be the reason ?
So it seems that your config file codec is correct. How do you launch your distribution training processes? Using torch.distributed.launch or using slurm command?
I used torch.distributed.launch
the exactly function of this
bash ./tools/dist_train.sh
but i am working on the high computation system of my university , not on my own pc with multiple gpu's
can i change the base_batch_size in the deafult_runtime.py instead to be base_batch_size=16 become base_batch_size=32 then change the number of batch_size= samples_per_gpu = 8 and keep workers_per_gpu=2 and use 4 gpu's
would that be possible and increase the speed of train my model within 4 gpu's
So it seems that your config file codec is correct. How do you launch your distribution training processes? Using torch.distributed.launch or using slurm command?
I am trying also with slurm , i got such error after running the command directly into the terminal.
I don't what was the exact issue or problem?
I am doing the training also by send a .sh script to the high computation of the university.
In general , I use such type of script:
i am not sure if i wrote it correctly. as it still needs time to be sent to the system and run.
any suggestion or help, I will be so grateful
would that be possible and increase the speed of train my model within 4 gpu's
It depends on what the bottleneck of your training process is. Sometimes only increasing the batch size does not work because the bottleneck is the cpu data processing or the data IO. Usually you need to increase the workers_per_gpu if you use a larger batch for training.
i am not sure if i wrote it correctly. as it still needs time to be sent to the system and run.
The error message shows that the slurm's srun command is incorrect. I don't know what is the required command of your school's cluster. Maybe you need to follow it's document or ask IT in your school.
I am trying again to run my model with only 2 gpus , on a huge dataset (the whole coco dataset).
my environment is as bellow :
and i got the same error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
and i think my bottleneck is the huge dataset in additional to using attention mechanism. i used the script of distribution training (./tools/dist_train.sh). I make the batch_size =8 and worker_per_gpu=4
please notice also when i run the model in one gpu , it is working but with multiple gpu (for example here is 2 gpus) it does not work
Prerequisite
💬 Describe the reimplementation questions
I am not sure if it is a bug or error in implementation
I run this code for distribution training for a model fsaf: bash tools/dist_train.sh configs/machine/fsaf_r101_fpn_1x_coco.py 4 --work-dir train_results/fsaf_ciou_resnet101
I run it within a bash script with activate python environment.
the error massage is as bellow:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
the model is being able to run with one gpu outside the bash script , but with low speed , so i would like to use the distribution training on 4 gpu's to make it faster.
Environment
my environment after running 'python mmdet/utils/collect_env.py' is
I created the environment using the python environment as the coda is not working with the server system i have. I installed pytorch using pip from the official site.
Expected results
there is no results until now, as the model is not running.
Additional information
i did some modification on my model which i am sure it is correct. as i understand each steps i used my own dataset which has similar format as the one of coco dataset and json files are valid i don't have any idea why this error occurred.