loiccordone / object-detection-with-spiking-neural-networks

Repository code for the IJCNN 2022 paper "Object Detection with Spiking Neural Networks on Automotive Event Data"
MIT License
57 stars 12 forks source link

Question about EOFError #4

Closed xxyll closed 2 years ago

xxyll commented 2 years ago

Hello, Today I try to run python object_detection.py -path ./PropheseeDataset -backbone vgg-11 -T 5 -tbin 2 -b 8 -epochs 2. During the first epoch of training, it reports the 'EOFError', do you know how to solve it?

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Epoch 0: 13%|▏| 226/1679 [03:29<22:20, 1.08it/s, loss=4.55, train_loss_bbox_stTraceback (most recent call last): File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/queues.py", line 239, in _feed File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 348, in reduce_storage File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in init OSError: [Errno 24] Too many open files Epoch 0: 14%|▏| 227/1679 [03:30<22:19, 1.08it/s, loss=4.51, train_loss_bbox_stTraceback (most recent call last): File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/queues.py", line 239, in _feed File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 348, in reduce_storage File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in init OSError: [Errno 24] Too many open files Epoch 0: 14%|▏| 228/1679 [03:31<22:18, 1.08it/s, loss=4.55, train_loss_bbox_stTraceback (most recent call last): File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/queues.py", line 239, in _feed File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 348, in reduce_storage File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in init OSError: [Errno 24] Too many open files Epoch 0: 14%|▏| 230/1679 [03:33<22:16, 1.08it/s, loss=4.44, train_loss_bbox_stTraceback (most recent call last): File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve send(conn, destination_pid) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 183, in send_handle with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s: File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/socket.py", line 543, in fromfd Traceback (most recent call last): File "/home/lxy/Experiment/object-detection-with-spiking-neural-networks/object_detection.py", line 135, in nfd = dup(fd) OSError main() File "/home/lxy/Experiment/object-detection-with-spiking-neural-networks/object_detection.py", line 127, in main : [Errno 24] Too many open files trainer.fit(module, train_dataloader, val_dataloader) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit self._run(model) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run self._dispatch() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch self.accelerator.start_training(self) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage return self._run_train() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train self.fit_loop.run() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, kwargs) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, *kwargs) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epochloop.py", line 118, in advance , (batch, is_last) = next(dataloader_iter) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/profiler/base.py", line 104, in profile_iterable value = next(iterator) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 672, in prefetch_iterator for val in it: File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 589, in next return self.request_next_batch(self.loader_iters) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 617, in request_next_batch return apply_to_collection(loader_iters, Iterator, next_fn) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection return function(data, args, kwargs) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 604, in next_fn batch = next(iterator) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data idx, data = self._get_data() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data success, data = self._try_get_data() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/queues.py", line 116, in get return _ForkingPickler.loads(res) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd fd = df.detach() File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle return recvfds(s, 1)[0] File "/home/lxy/anaconda3/envs/SNN-SJ/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds raise EOFError EOFError Epoch 0: 14%|█▎ | 230/1679 [03:33<22:22, 1.08it/s, loss=4.44, train_loss_bbox_step=2.800, train_loss_classif_step=0.877, train_loss_step=3.680]

loiccordone commented 2 years ago

Hello,

I have never encountered this error, the one at the beginning "OSError: [Errno 24] Too many open files". It doesn't seem to come from my code, so you could try some solutions proposed in this stackoverflow thread : Python Subprocess: Too Many Open Files

Good luck!

xxyll commented 2 years ago

Thanks, I solved this problem!!! I'm running 'train' code, it may take a day. I see there is '-pretrained path/to/pretrained_model' in 'test'. I would like to ask in advance, will the 'pretrained_model' be generated after training?

loiccordone commented 2 years ago

Hello, unfortunately no if you didn't specify the -save_ckpt parameter. I will change that so that the default behavior is to save checkpoints. Sorry for the inconvenience.

xxyll commented 2 years ago

Oh I see it. I‘m going to retrain and wait for results~

xxyll commented 2 years ago

Hello, unfortunately no if you didn't specify the -save_ckpt parameter.

I’d like to ask that the -save_ckpt set up 'save_top_k=3', do I need to adjust the parameters according to the actual epochs? After training may generate a file named 'ckpt-od-gen1-vgg-11', is that be the path of 'pretrained_model'?

loiccordone commented 2 years ago

No you don't need to adjust save_top_k, it's the number of checkpoints saved by Pytorch Lightning. 3 means that it will save the 3 best models according to the specified metric. Check out the documentation of Pytorch Lightning for more information. Yes that would be the path of pretrained model.