Closed t-walker-21 closed 4 years ago
Hi, why training for 500 epochs? There is no need to train that much.
It's a maximum number. You will have good results way before that (for YCB, it's about 100 epochs). Also please double check your file loading. Not sure about the problem here.
Okay. I have been trying to debug my dataset.py on my own dataset. What confuses me is, if I try to overfit on a small set of 3 data points from linemod, the training reaches the 12 mm refine_margin. But if i print the target translation and the predicted translation (which I get from finding the pred_c argmax), they're are not close together, especially the Z translation. How can the avg dist reach 12mm when the predicted and target translation are so different?
Okay, I see how it is computed in eval_X.py
Hello. I am attempting to train with my own dataset. I encounter this strange issue when the training script reached between 495 - 500 epochs:
_Traceback (most recent call last): File "./tools/train.py", line 237, in
main()
File "./tools/train.py", line 131, in main
for i, data in enumerate(dataloader, 0):
File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.5/multiprocessing/queues.py", line 113, in get
return ForkingPickler.loads(res)
File "/home/twalker81/virtualEnvironments/dense-fusion/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 160, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
I also tried setting the num_workers to 0, and setting the torch multiprocessing to ['file_system']. That, in turn, gives this error:
I thought maybe I was doing something wrong with my dataset.py; however, I also get the same error with the default ycb train script. I am in Pytorch 1 branch on torch==1.4.0. What is the problem?