j96w / DenseFusion

"DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion" code repository
https://sites.google.com/view/densefusion
MIT License
1.09k stars 300 forks source link

Too many files open #143

Closed t-walker-21 closed 4 years ago

t-walker-21 commented 4 years ago

Hello. I am attempting to train with my own dataset. I encounter this strange issue when the training script reached between 495 - 500 epochs:

_Traceback (most recent call last): File "./tools/train.py", line 237, in main() File "./tools/train.py", line 131, in main for i, data in enumerate(dataloader, 0): File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data idx, data = self._get_data() File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data success, data = self._try_get_data() File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.5/multiprocessing/queues.py", line 113, in get return ForkingPickler.loads(res) File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/usr/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle return recvfds(s, 1)[0] File "/usr/lib/python3.5/multiprocessing/reduction.py", line 160, in recvfds len(ancdata)) RuntimeError: received 0 items of ancdata

I also tried setting the num_workers to 0, and setting the torch multiprocessing to ['file_system']. That, in turn, gives this error:

----------epoch 500 train finish---------<<<<<<<< 2020-03-16 17:00:27,330 : Test time 00h 07m 42s, Testing started Traceback (most recent call last): File "./tools/train.py", line 238, in File "./tools/train.py", line 176, in main File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 345, in next File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/_utils/fetch.py", line 44, in File "/home/twalker81/DenseFusion/datasets/ycb/dataset.py", line 99, in getitem File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/PIL/Image.py", line 2809, in open OSError: [Errno 24] Too many open files: './datasets/ycb/YCB_Video_Dataset/data/0000/000001-depth.png'

I thought maybe I was doing something wrong with my dataset.py; however, I also get the same error with the default ycb train script. I am in Pytorch 1 branch on torch==1.4.0. What is the problem?

j96w commented 4 years ago

Hi, why training for 500 epochs? There is no need to train that much.

t-walker-21 commented 4 years ago

Isn't 500 the default? https://github.com/j96w/DenseFusion/blob/1bf531bacf1c9a73af8a99189b2a52fd6de0d969/tools/train.py#L43

j96w commented 4 years ago

It's a maximum number. You will have good results way before that (for YCB, it's about 100 epochs). Also please double check your file loading. Not sure about the problem here.

t-walker-21 commented 4 years ago

Okay. I have been trying to debug my dataset.py on my own dataset. What confuses me is, if I try to overfit on a small set of 3 data points from linemod, the training reaches the 12 mm refine_margin. But if i print the target translation and the predicted translation (which I get from finding the pred_c argmax), they're are not close together, especially the Z translation. How can the avg dist reach 12mm when the predicted and target translation are so different?

t-walker-21 commented 4 years ago

Okay, I see how it is computed in eval_X.py