ifzhang / FairMOT

[IJCV-2021] FairMOT: On the Fairness of Detection and Re-Identification in Multi-Object Tracking
MIT License
4k stars 933 forks source link

raise AssertionError("Invalid device id") AssertionError: Invalid device id #98

Open prabhat-123 opened 4 years ago

prabhat-123 commented 4 years ago

When I run the following code the following error appears ...How to get rid of this error. I am using google colab to run the project...... The problem may seem in no of gpus since I only have 1 gpu available however the project seems to be trained on multiple gpus... Can you please help me with this to run this efficiently.

Using tensorboardX Fix size testing. training chunk_sizes: [4, 4] The output will be saved to /content/FairMOT/src/lib/../../exp/mot/all_dla34 Setting up data... dataset summary OrderedDict([('mot15', 501.0)]) total # identities: 502 start index OrderedDict([('mot15', 0)]) heads {'hm': 1, 'wh': 2, 'id': 512, 'reg': 2} Namespace(K=128, arch='dla_34', batch_size=8, cat_spec_wh=False, chunk_sizes=[4, 4], conf_thres=0.6, data_cfg='../src/lib/cfg/data.json', data_dir='/content/FairMOT', dataset='jde', debug_dir='/content/FairMOT/src/lib/../../exp/mot/all_dla34/debug', dense_wh=False, det_thres=0.3, down_ratio=4, exp_dir='/content/FairMOT/src/lib/../../exp/mot', exp_id='all_dla34', fix_res=True, gpus=[0, 1], gpus_str='0,1', head_conv=256, heads={'hm': 1, 'wh': 2, 'id': 512, 'reg': 2}, hide_data_time=False, hm_weight=1, id_loss='ce', id_weight=1, img_size=(1088, 608), input_h=1088, input_res=1088, input_video='../videos/MOT16-03.mp4', input_w=608, keep_res=False, load_model='../models/ctdet_coco_dla_2x.pth', lr=0.0001, lr_step=[20, 27], master_batch_size=4, mean=None, metric='loss', min_box_area=200, mse_loss=False, nID=502, nms_thres=0.4, norm_wh=False, not_cuda_benchmark=False, not_prefetch_test=False, not_reg_offset=False, num_classes=1, num_epochs=30, num_iters=-1, num_stacks=1, num_workers=8, off_weight=1, output_format='video', output_h=272, output_res=272, output_root='../results', output_w=152, pad=31, print_iter=0, reg_loss='l1', reg_offset=True, reid_dim=512, resume=False, root_dir='/content/FairMOT/src/lib/../..', save_all=False, save_dir='/content/FairMOT/src/lib/../../exp/mot/all_dla34', seed=317, std=None, task='mot', test=False, test_mot15=False, test_mot16=False, test_mot17=False, test_mot20=False, track_buffer=30, trainval=False, val_intervals=5, val_mot15=False, val_mot16=False, val_mot17=False, val_mot20=False, vis_thresh=0.5, wh_weight=0.1) Creating model... loaded ../models/ctdet_coco_dla_2x.pth, epoch 230 Skip loading parameter hm.2.weight, required shapetorch.Size([1, 256, 1, 1]), loaded shapetorch.Size([80, 256, 1, 1]). If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. Skip loading parameter hm.2.bias, required shapetorch.Size([1]), loaded shapetorch.Size([80]). If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.0.weight.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.0.bias.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.2.weight.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.2.bias.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. Starting training... Traceback (most recent call last): File "train.py", line 97, in main(opt) File "train.py", line 64, in main trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device) File "/content/FairMOT/src/lib/trains/base_trainer.py", line 36, in set_device chunk_sizes=chunk_sizes).to(device) File "/content/FairMOT/src/lib/models/data_parallel.py", line 127, in DataParallel return torch.nn.DataParallel(module, device_ids, output_device, dim) File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 133, in init _check_balance(self.device_ids) File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 19, in _check_balance dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 19, in dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] File "/usr/local/lib/python3.7/site-packages/torch/cuda/init.py", line 318, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id

ifzhang commented 4 years ago

You can set 0 here instead of 0,1: https://github.com/ifzhang/FairMOT/blob/1851158a1bc025da7e6cb839ddef9d14e33b404a/src/train.py#L95

prabhat-123 commented 4 years ago

I have tried it but it is still throwing me same error.. It's been 2 days and I am unable to get a fix...

ifzhang commented 4 years ago

Have you tried to edit this? https://github.com/ifzhang/FairMOT/blob/1851158a1bc025da7e6cb839ddef9d14e33b404a/src/lib/opts.py#L28

prabhat-123 commented 4 years ago

No, I haven't .. I will try it and reply you in a couple of minutes.

prabhat-123 commented 4 years ago

no fix ..... still getting the same error

prabhat-123 commented 4 years ago

I am also having this issue while forming the deformable convolution network backbone..... After I execute !sh make.sh command then i tried to check my cuda by using the command !python testcuda.py then it is showing me this error too and I neglect this error as this doesnot cause anything but it is throwing the "invalid device id " error .... I have already tried the steps you mentioned but it is not gonna fix my problem torch.Size([2, 64, 128, 128]) torch.Size([20, 32, 7, 7]) torch.Size([20, 32, 7, 7]) torch.Size([20, 32, 7, 7]) 0.971507, 1.943014 0.971507, 1.943014 Zero offset passed /usr/local/lib/python3.7/site-packages/torch/autograd/gradcheck.py:242: UserWarning: At least one of the inputs that requires gradient is not of double precision floating point. This check will likely fail if all the inputs are not of double precision floating point. 'At least one of the inputs that requires gradient ' check_gradient_dpooling: True Traceback (most recent call last): File "testcuda.py", line 265, in check_gradient_dconv() File "testcuda.py", line 97, in check_gradient_dconv eps=1e-3, atol=1e-4, rtol=1e-2)) File "/usr/local/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 289, in gradcheck 'numerical:%s\nanalytical:%s\n' % (i, j, n, a)) File "/usr/local/lib/python3.7/site-packages/torch/autograd/gradcheck.py", line 227, in fail_test raise RuntimeError(msg) RuntimeError: Jacobian mismatch for output 0 with respect to input 1, numerical:tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]]) analytical:tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])

zhijiejia commented 4 years ago

maybe you try modify as the picture, if you make sure the count of gpu image

zhijiejia commented 4 years ago

when i training, the CUDA memory is not enough, can modify the batch_size in the same position. Although this method is rough, but it works. image

yuehui130 commented 3 years ago

@ifzhang @zhijiejia When I set batch_size to 2, there is still an error. I run on this code. python train.py mot --load_model ../models/fairmot_dla34.pth --num_epochs 20 --lr_step 15 --data_cfg ../src/lib/cfg/data.json

then I get this error. image

How can I solve this problem?

niclastrelle commented 3 years ago

my fix was to change opt.gpus and opt.environment after the start of main(), because opt.environment gets changed in train.py in the line: os.environ['CUDA_VISIBLE_DEVICES'] = gpus_str this makes the fix

if __name__ == '__main__':
    os.environ['CUDA_VISIBLE_DEVICES'] = '0' #original: '0, 1'
    opt = opts().parse()
    main(opt)

non-existent.

I had one gpu with the id 0. I changed opt.gpus in [0] and os.environ in '0'

JAYCHOU2020 commented 3 years ago

my fix was to change opt.gpus and opt.environment after the start of main(), because opt.environment gets changed in train.py in the line: os.environ['CUDA_VISIBLE_DEVICES'] = gpus_str this makes the fix

if __name__ == '__main__':
    os.environ['CUDA_VISIBLE_DEVICES'] = '0' #original: '0, 1'
    opt = opts().parse()
    main(opt)

non-existent.

I had one gpu with the id 0. I changed opt.gpus in [0] and os.environ in '0'

when i set gpu =0 i get this problem opt.gpus = [i for i in range(len(opt.gpus))] if opt.gpus[0] >=0 else [-1] TypeError: 'int' object is not subscriptable how to fix it?

Liozb commented 2 months ago

I encountered a similar problem while I had CUDA_VISIBLE_DEVICES="2" (2 is the gpu id ). I've fixed it by using unset CUDA_VISIBLE_DEVICES.