r-pad / zephyr

Source code for ZePHyR: Zero-shot Pose Hypothesis Rating @ ICRA 2021
https://bokorn.github.io/zephyr/
MIT License
24 stars 2 forks source link

File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 346793) is killed by signal: Aborted. #11

Closed monajalal closed 1 year ago

monajalal commented 1 year ago

I am having a problem with DataLoader. Could you please help me with the fix?

(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py     --model_name pn2     --dataset_root ./data/ycb/matches_data_test/     --dataset_name ycbv     --dataset HSVD_diff_uv_norm     --no_valid_proj --no_valid_depth     --loss_cutoff log     --exp_name final     --resume_path ./ckpts/final_ycbv.ckpt
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Traceback (most recent call last):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 346793) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mona/zephyr/python/zephyr/test.py", line 59, in <module>
    main(args)
  File "/home/mona/zephyr/python/zephyr/test.py", line 53, in main
    trainer.test(model, boptest_loader)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in test
    self.fit(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit
    self.dp_train(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train
    self.run_pretrain_routine(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in run_pretrain_routine
    self.run_evaluation(test_mode=True)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 346793) exited unexpectedly
Testing:   0%|          | 0/4123 [00:01<?, ?it/s]
monajalal commented 1 year ago

if I run it with num_workers=0 still error

(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py     --model_name pn2     --dataset_root ./data/ycb/matches_data_test/     --dataset_name ycbv     --dataset HSVD_diff_uv_norm     --no_valid_proj --no_valid_depth     --loss_cutoff log     --exp_name final     --resume_path ./ckpts/final_ycbv.ckpt --num_workers=0
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Aborted (core dumped)
monajalal commented 1 year ago

I figure the problem is

Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed. Do you know how can this be fixed for your repo?

georgegu1997 commented 1 year ago

Hi,

Can you try running the testing script on the LM-O dataset or running the notebook? Let me know if the bug still happens there.

Also, the zephyr_c module we have only implemented a normal computation function. This may be replaced by some existing Python libraries, to avoid the bug.

monajalal commented 1 year ago

@georgegu1997 thanks a lot for your response. I tried it with LMO and here's the results. I still get assertion failed for eingen3.

(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py     --model_name pn2     --dataset_root ./data/lmo/matches_data_test/     --dataset_name lmo     --dataset HSVD_diff_uv_norm     --no_valid_proj --no_valid_depth     --loss_cutoff log     --exp_name final     --resume_path ./ckpts/final_lmo.ckpt
exp_name: pn2_HSVD_diff_uv_norm_final
args.inconst_ratio_th = 100
Initializing lmo dataset from ./data/lmo/matches_data_test/
Using BOP dataset format. Total dataset: 1445
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Traceback (most recent call last):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 40319) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mona/zephyr/python/zephyr/test.py", line 59, in <module>
    main(args)
  File "/home/mona/zephyr/python/zephyr/test.py", line 53, in main
    trainer.test(model, boptest_loader)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in test
    self.fit(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit
    self.dp_train(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train
    self.run_pretrain_routine(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in run_pretrain_routine
    self.run_evaluation(test_mode=True)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 40319) exited unexpectedly
Testing:   0%|          | 0/1445 [00:00<?, ?it/s]

I will report back the notebook result shortly.

georgegu1997 commented 1 year ago

Hi @monajalal

Thanks for the description! I reproduced the results in the notebook successfully just now and I did not encounter the issue you described.

I followed the procedure to set up the environment as described here. One change I made is that I used conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia to install the newest version of pytorch as I ran it on an RTX 3090.

I think this is an issue with Eigen or other non-python dependencies and maybe this is because of a version discrepancy. Perhaps can you try setting up the env using these instructions?

monajalal commented 1 year ago

https://github.com/r-pad/zephyr/issues/14#issuecomment-1579472005