qfgaohao / pytorch-ssd

MobileNetV1, MobileNetV2, VGG based SSD/SSD-lite implementation in Pytorch 1.0 / Pytorch 0.4. Out-of-box support for retraining on Open Images dataset. ONNX and Caffe2 support. Experiment Ideas like CoordConv.
https://medium.com/@smallfishbigsea/understand-ssd-and-implement-your-own-caa3232cd6ad
MIT License
1.39k stars 529 forks source link

Training doesn't start - I'm getting an error with the data loader #168

Open rsamvelyan opened 2 years ago

rsamvelyan commented 2 years ago

Hello

I am trying to run it on my Windows machine. My dataset seems to be correct. When I start the training I get an error.

Here is the call with the arguments: python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5

Here is the error I get: (base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd (base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5 2021-12-07 23:11:37,702 - root - INFO - Use Cuda. 2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/') 2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets. 2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344 Minimum Number of Images for a Class: -1 Label Distribution: apple: 5376 2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt. 2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344 2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets. 2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480 Minimum Number of Images for a Class: -1 Label Distribution: apple: 1920 2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480 2021-12-07 23:11:38,477 - root - INFO - Build network. 2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth 2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model. 2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01. 2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler. 2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0. Traceback (most recent call last): File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in train(train_loader, net, criterion, optimizer, File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train for i, data in enumerate(loader): File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TrainAugmentation.init..' (base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda. Traceback (most recent call last): File "", line 1, in File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?

Any ideas?

Thanks a lot!

gururaj-bhat commented 2 years ago

I am also facing similar issues , but on Ubuntu. It is stuck at the below point , while training https://github.com/qfgaohao/pytorch-ssd/blob/master/train_ssd.py#L116

I guess this is because of Pytorch version , I am using latets 1.10 version and probably we should strictly use 1.0.0 only.

jyan-R commented 2 years ago

Hello

I am trying to run it on my Windows machine. My dataset seems to be correct. When I start the training I get an error.

Here is the call with the arguments: python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5

Here is the error I get: (base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd (base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5 2021-12-07 23:11:37,702 - root - INFO - Use Cuda. 2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/') 2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets. 2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344 Minimum Number of Images for a Class: -1 Label Distribution: apple: 5376 2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt. 2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344 2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets. 2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480 Minimum Number of Images for a Class: -1 Label Distribution: apple: 1920 2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480 2021-12-07 23:11:38,477 - root - INFO - Build network. 2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth 2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model. 2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01. 2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler. 2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0. Traceback (most recent call last): File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in train(train_loader, net, criterion, optimizer, File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train for i, data in enumerate(loader): File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TrainAugmentation.init..' (base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda. Traceback (most recent call last): File "", line 1, in File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?

Any ideas?

Thanks a lot!

the same issue, have you got any ideas to solve it? thanks

jyan-R commented 2 years ago

I am also facing similar issues , but on Ubuntu. It is stuck at the below point , while training https://github.com/qfgaohao/pytorch-ssd/blob/master/train_ssd.py#L116

I guess this is because of Pytorch version , I am using latets 1.10 version and probably we should strictly use 1.0.0 only.

using 1.0.0 will raise a new problem: module 'torch.jit' has no attribute 'unused'

Biswajit-Banerjee commented 1 year ago

Hello

I am trying to run it on my Windows machine. My dataset seems to be correct. When I start the training I get an error.

Here is the call with the arguments: python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5

Here is the error I get: (base) PS C:\Users\rsamv> cd C:\Users\rsamv\Documents\pytorch-ssd (base) PS C:\Users\rsamv\Documents\pytorch-ssd> python train_ssd.py --dataset_type open_images --datasets C:/Users/rsamv/Documents/data/open_images_datasets/apples --net mb1-ssd --pretrained_ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth --scheduler cosine --lr 0.01 --t_max 100 --validation_epochs 5 --num_epochs 100 --base_net_lr 0.01 --batch_size 5 2021-12-07 23:11:37,702 - root - INFO - Use Cuda. 2021-12-07 23:11:37,703 - root - INFO - Namespace(dataset_type='open_images', datasets=['C:/Users/rsamv/Documents/data/open_images_datasets/apples'], validation_dataset=None, balance_data=False, net='mb1-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.01, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=0.01, extra_layers_lr=None, base_net=None, pretrained_ssd='C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', milestones='80,100', t_max=100.0, batch_size=5, num_epochs=100, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/') 2021-12-07 23:11:37,703 - root - INFO - Prepare training datasets. 2021-12-07 23:11:38,263 - root - INFO - Dataset Summary:Number of Images: 1344 Minimum Number of Images for a Class: -1 Label Distribution: apple: 5376 2021-12-07 23:11:38,277 - root - INFO - Stored labels into file models/open-images-model-labels.txt. 2021-12-07 23:11:38,278 - root - INFO - Train dataset size: 1344 2021-12-07 23:11:38,279 - root - INFO - Prepare Validation datasets. 2021-12-07 23:11:38,472 - root - INFO - Dataset Summary:Number of Images: 480 Minimum Number of Images for a Class: -1 Label Distribution: apple: 1920 2021-12-07 23:11:38,476 - root - INFO - validation dataset size: 480 2021-12-07 23:11:38,477 - root - INFO - Build network. 2021-12-07 23:11:38,537 - root - INFO - Init from pretrained ssd C:/Users/rsamv/Documents/pytorch-ssd/models/mb1-ssd/mobilenet-v1-ssd-mp-0_675.pth 2021-12-07 23:11:38,583 - root - INFO - Took 0.05 seconds to load the model. 2021-12-07 23:11:38,996 - root - INFO - Learning rate: 0.01, Base net learning rate: 0.01, Extra Layers learning rate: 0.01. 2021-12-07 23:11:38,997 - root - INFO - Uses CosineAnnealingLR scheduler. 2021-12-07 23:11:38,997 - root - INFO - Start training from epoch 0. Traceback (most recent call last): File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 325, in train(train_loader, net, criterion, optimizer, File "C:\Users\rsamv\Documents\pytorch-ssd\train_ssd.py", line 116, in train for i, data in enumerate(loader): File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\rsamv\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\rsamv\anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'TrainAugmentation.init..' (base) PS C:\Users\rsamv\Documents\pytorch-ssd> 2021-12-07 23:11:40,772 - root - INFO - Use Cuda. Traceback (most recent call last): File "", line 1, in File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\rsamv\anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

It seems that the loader variable has a problem. I wonder if it's caused by some incompatibility with Windows, for instance at the Path level?

Any ideas?

Thanks a lot!

I am also getting the same issue.

From what I could gather it seems issue with the pickling of lambda function in multi-processing. So disabling the multiprocessing data loading worked for me. You can do so by either mentioning --num_workers 0 or making the default value 0 in train_ssd.py for num_workers.