Closed ghost closed 2 years ago
Hi @wozcoder !
You have a CUDA out of memory error, meaning you don't have enough memory on your GPU. You can easily fix this by changing the batch size to a smaller value by using the "-b" parameter. By default it's 64, try 48, 32 or even less until you don't encounter an out of memory error.
Hello @loiccordone , I lowered it to batch size 16 during the training. It went through the preprocessing step and was about to complete the first epoch of the training. It gave me the following error.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 429/429 [2:26:34<00:00, 20.5s/File]
Done! File saved as datasets/gen1/gen1_val_100_20.0ms_2tbin.pt
Done! File saved as datasets/gen1/gen1_val_100_20.0ms_2tbin.pt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Number of parameters: 12652695
| Name | Type | Params
-----------------------------------------------------------------
0 | backbone | DetectionBackbone | 11.9 M
1 | anchor_generator | GridSizeDefaultBoxGenerator | 0
2 | head | SSDHead | 742 K
-----------------------------------------------------------------
12.7 M Trainable params
0 Non-trainable params
12.7 M Total params
50.611 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Validation sanity check: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 10.84s/it]
[0] val results:
creating index...
index created!
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.29s).
Accumulating evaluation results...
DONE (t=0.02s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
Epoch 0: 22%|█████████████ | 999/4510 [29:49<1:44:44, 1.79s/it, loss=3.39, train_loss_bbox_step=2.750, train_loss_classif_step=0.627, train_loss_step=3.380]Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_10873_8348> in read-write mode
Epoch 0: 22%|████████████▊ | 1000/4510 [29:51<1:44:42, 1.79s/it, loss=3.42, train_loss_bbox_step=2.810, train_loss_classif_step=0.599, train_loss_step=3.410]Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_10920_8346> in read-write mode
Epoch 0: 22%|████████████▊ | 1001/4510 [29:53<1:44:40, 1.79s/it, loss=3.43, train_loss_bbox_step=2.870, train_loss_classif_step=0.644, train_loss_step=3.520]Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_10956_8344> in read-write mode
Epoch 0: 22%|████████████▉ | 1002/4510 [29:55<1:44:38, 1.79s/it, loss=3.43, train_loss_bbox_step=2.410, train_loss_classif_step=0.705, train_loss_step=3.120]Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1634272172048/work/aten/src/ATen/MapAllocator.cpp":263, please report a bug to PyTorch. unable to open shared memory object </torch_10991_8342> in read-write mode
Epoch 0: 22%|████████████▉ | 1003/4510 [29:56<1:44:36, 1.79s/it, loss=3.41, train_loss_bbox_step=2.410, train_loss_classif_step=0.627, train_loss_step=3.030]Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
send(conn, destination_pid)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 183, in send_handle
Traceback (most recent call last):
File "object_detection.py", line 135, in <module>
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/socket.py", line 543, in fromfd
main()
File "object_detection.py", line 127, in main
trainer.fit(module, train_dataloader, val_dataloader)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 118, in advance
_, (batch, is_last) = next(dataloader_iter)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/profiler/base.py", line 104, in profile_iterable
value = next(iterator)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 672, in prefetch_iterator
for val in it:
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 589, in __next__
return self.request_next_batch(self.loader_iters)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 617, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next_fn)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 604, in next_fn
batch = next(iterator)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
fd = df.detach()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
Epoch 0: 22%|██▏ | 1003/4510 [29:58<1:44:43, 1.79s/it, loss=3.41, train_loss_bbox_step=2.410, train_loss_classif_step=0.627, train_loss_step=3.030]
I have a good GPU, as you can see below, but I do not understand why I am getting the error above:
(pytorch_p38) sh-4.2$ nvidia-smi
Thu Jul 14 10:50:11 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 26C P8 16W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(pytorch_p38) sh-4.2$
Can you please share what this "EOFError" is caused by?
Hi, it's my environment.
Package Version absl-py 1.1.0 aiohttp 3.8.1 aiosignal 1.2.0 async-timeout 4.0.2 attrs 21.4.0 cachetools 5.2.0 certifi 2022.6.15 charset-normalizer 2.1.0 cycler 0.11.0 fastrlock 0.8 fonttools 4.33.3 frozenlist 1.3.0 fsspec 2022.5.0 future 0.18.2 google-auth 2.9.0 google-auth-oauthlib 0.4.6 grpcio 1.47.0 idna 3.3 importlib-metadata 4.12.0 kiwisolver 1.4.3 Markdown 3.3.7 matplotlib 3.5.2 multidict 6.0.2 numpy 1.23.0 oauthlib 3.2.0 packaging 21.3 Pillow 9.2.0 pip 21.2.4 protobuf 3.19.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools 2.0.4 pyDeprecate 0.3.1 pyparsing 3.0.9 python-dateutil 2.8.2 pytorch-lightning 1.4.4 PyYAML 6.0 requests 2.28.1 requests-oauthlib 1.3.1 rsa 4.8 scipy 1.8.1 setuptools 61.2.0 six 1.16.0 spikingjelly 0.0.0.0.12 tensorboard 2.9.1 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 torch 1.11.0 torchmetrics 0.5.0 torchvision 0.12.0 tqdm 4.64.0 typing_extensions 4.3.0 urllib3 1.26.9 Werkzeug 2.1.2 wheel 0.37.1 yarl 1.7.2 zipp 3.8.0
Hello, When I run the following command:
python object_detection.py -path path/to/GEN1_dataset -backbone vgg-11 -T 5 -tbin 2
I the code builds the dataset but it gets stuck during the first epoch of training.Above is the error that I am receiving. Could you please help me resolve this issue?