Open prabhat-123 opened 4 years ago
You can set 0 here instead of 0,1: https://github.com/ifzhang/FairMOT/blob/1851158a1bc025da7e6cb839ddef9d14e33b404a/src/train.py#L95
I have tried it but it is still throwing me same error.. It's been 2 days and I am unable to get a fix...
Have you tried to edit this? https://github.com/ifzhang/FairMOT/blob/1851158a1bc025da7e6cb839ddef9d14e33b404a/src/lib/opts.py#L28
No, I haven't .. I will try it and reply you in a couple of minutes.
no fix ..... still getting the same error
I am also having this issue while forming the deformable convolution network backbone..... After I execute !sh make.sh command then i tried to check my cuda by using the command !python testcuda.py then it is showing me this error too and I neglect this error as this doesnot cause anything but it is throwing the "invalid device id " error .... I have already tried the steps you mentioned but it is not gonna fix my problem
torch.Size([2, 64, 128, 128])
torch.Size([20, 32, 7, 7])
torch.Size([20, 32, 7, 7])
torch.Size([20, 32, 7, 7])
0.971507, 1.943014
0.971507, 1.943014
Zero offset passed
/usr/local/lib/python3.7/site-packages/torch/autograd/gradcheck.py:242: UserWarning: At least one of the inputs that requires gradient is not of double precision floating point. This check will likely fail if all the inputs are not of double precision floating point.
'At least one of the inputs that requires gradient '
check_gradient_dpooling: True
Traceback (most recent call last):
File "testcuda.py", line 265, in
maybe you try modify as the picture, if you make sure the count of gpu
when i training, the CUDA memory is not enough, can modify the batch_size in the same position. Although this method is rough, but it works.
@ifzhang @zhijiejia When I set batch_size to 2, there is still an error. I run on this code.
python train.py mot --load_model ../models/fairmot_dla34.pth --num_epochs 20 --lr_step 15 --data_cfg ../src/lib/cfg/data.json
then I get this error.
How can I solve this problem?
my fix was to change opt.gpus and opt.environment after the start of main(), because opt.environment gets changed in train.py in the line: os.environ['CUDA_VISIBLE_DEVICES'] = gpus_str
this makes the fix
if __name__ == '__main__':
os.environ['CUDA_VISIBLE_DEVICES'] = '0' #original: '0, 1'
opt = opts().parse()
main(opt)
non-existent.
I had one gpu with the id 0.
I changed opt.gpus
in [0]
and os.environ
in '0'
my fix was to change opt.gpus and opt.environment after the start of main(), because opt.environment gets changed in train.py in the line:
os.environ['CUDA_VISIBLE_DEVICES'] = gpus_str
this makes the fixif __name__ == '__main__': os.environ['CUDA_VISIBLE_DEVICES'] = '0' #original: '0, 1' opt = opts().parse() main(opt)
non-existent.
I had one gpu with the id 0. I changed
opt.gpus
in[0]
andos.environ
in'0'
when i set gpu =0 i get this problem opt.gpus = [i for i in range(len(opt.gpus))] if opt.gpus[0] >=0 else [-1] TypeError: 'int' object is not subscriptable how to fix it?
I encountered a similar problem while I had CUDA_VISIBLE_DEVICES="2" (2 is the gpu id ). I've fixed it by using unset CUDA_VISIBLE_DEVICES.
When I run the following code the following error appears ...How to get rid of this error. I am using google colab to run the project...... The problem may seem in no of gpus since I only have 1 gpu available however the project seems to be trained on multiple gpus... Can you please help me with this to run this efficiently.
Using tensorboardX Fix size testing. training chunk_sizes: [4, 4] The output will be saved to /content/FairMOT/src/lib/../../exp/mot/all_dla34 Setting up data... dataset summary OrderedDict([('mot15', 501.0)]) total # identities: 502 start index OrderedDict([('mot15', 0)]) heads {'hm': 1, 'wh': 2, 'id': 512, 'reg': 2} Namespace(K=128, arch='dla_34', batch_size=8, cat_spec_wh=False, chunk_sizes=[4, 4], conf_thres=0.6, data_cfg='../src/lib/cfg/data.json', data_dir='/content/FairMOT', dataset='jde', debug_dir='/content/FairMOT/src/lib/../../exp/mot/all_dla34/debug', dense_wh=False, det_thres=0.3, down_ratio=4, exp_dir='/content/FairMOT/src/lib/../../exp/mot', exp_id='all_dla34', fix_res=True, gpus=[0, 1], gpus_str='0,1', head_conv=256, heads={'hm': 1, 'wh': 2, 'id': 512, 'reg': 2}, hide_data_time=False, hm_weight=1, id_loss='ce', id_weight=1, img_size=(1088, 608), input_h=1088, input_res=1088, input_video='../videos/MOT16-03.mp4', input_w=608, keep_res=False, load_model='../models/ctdet_coco_dla_2x.pth', lr=0.0001, lr_step=[20, 27], master_batch_size=4, mean=None, metric='loss', min_box_area=200, mse_loss=False, nID=502, nms_thres=0.4, norm_wh=False, not_cuda_benchmark=False, not_prefetch_test=False, not_reg_offset=False, num_classes=1, num_epochs=30, num_iters=-1, num_stacks=1, num_workers=8, off_weight=1, output_format='video', output_h=272, output_res=272, output_root='../results', output_w=152, pad=31, print_iter=0, reg_loss='l1', reg_offset=True, reid_dim=512, resume=False, root_dir='/content/FairMOT/src/lib/../..', save_all=False, save_dir='/content/FairMOT/src/lib/../../exp/mot/all_dla34', seed=317, std=None, task='mot', test=False, test_mot15=False, test_mot16=False, test_mot17=False, test_mot20=False, track_buffer=30, trainval=False, val_intervals=5, val_mot15=False, val_mot16=False, val_mot17=False, val_mot20=False, vis_thresh=0.5, wh_weight=0.1) Creating model... loaded ../models/ctdet_coco_dla_2x.pth, epoch 230 Skip loading parameter hm.2.weight, required shapetorch.Size([1, 256, 1, 1]), loaded shapetorch.Size([80, 256, 1, 1]). If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. Skip loading parameter hm.2.bias, required shapetorch.Size([1]), loaded shapetorch.Size([80]). If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.0.weight.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.0.bias.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.2.weight.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. No param id.2.bias.If you see this, your model does not fully load the pre-trained weight. Please make sure you have correctly specified --arch xxx or set the correct --num_classes for your own dataset. Starting training... Traceback (most recent call last): File "train.py", line 97, in
main(opt)
File "train.py", line 64, in main
trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)
File "/content/FairMOT/src/lib/trains/base_trainer.py", line 36, in set_device
chunk_sizes=chunk_sizes).to(device)
File "/content/FairMOT/src/lib/models/data_parallel.py", line 127, in DataParallel
return torch.nn.DataParallel(module, device_ids, output_device, dim)
File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 133, in init
_check_balance(self.device_ids)
File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 19, in _check_balance
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 19, in
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File "/usr/local/lib/python3.7/site-packages/torch/cuda/init.py", line 318, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id