Closed monajalal closed 1 year ago
if I run it with num_workers=0 still error
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py --model_name pn2 --dataset_root ./data/ycb/matches_data_test/ --dataset_name ycbv --dataset HSVD_diff_uv_norm --no_valid_proj --no_valid_depth --loss_cutoff log --exp_name final --resume_path ./ckpts/final_ycbv.ckpt --num_workers=0
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Aborted (core dumped)
I figure the problem is
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed. Do you know how can this be fixed for your repo?
Hi,
Can you try running the testing script on the LM-O dataset or running the notebook? Let me know if the bug still happens there.
Also, the zephyr_c module we have only implemented a normal computation function. This may be replaced by some existing Python libraries, to avoid the bug.
@georgegu1997 thanks a lot for your response. I tried it with LMO and here's the results. I still get assertion failed for eingen3.
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py --model_name pn2 --dataset_root ./data/lmo/matches_data_test/ --dataset_name lmo --dataset HSVD_diff_uv_norm --no_valid_proj --no_valid_depth --loss_cutoff log --exp_name final --resume_path ./ckpts/final_lmo.ckpt
exp_name: pn2_HSVD_diff_uv_norm_final
args.inconst_ratio_th = 100
Initializing lmo dataset from ./data/lmo/matches_data_test/
Using BOP dataset format. Total dataset: 1445
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Traceback (most recent call last):
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 40319) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mona/zephyr/python/zephyr/test.py", line 59, in <module>
main(args)
File "/home/mona/zephyr/python/zephyr/test.py", line 53, in main
trainer.test(model, boptest_loader)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in test
self.fit(model)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit
self.dp_train(model)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train
self.run_pretrain_routine(model)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in run_pretrain_routine
self.run_evaluation(test_mode=True)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
success, data = self._try_get_data()
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 40319) exited unexpectedly
Testing: 0%| | 0/1445 [00:00<?, ?it/s]
I will report back the notebook result shortly.
Hi @monajalal
Thanks for the description! I reproduced the results in the notebook successfully just now and I did not encounter the issue you described.
I followed the procedure to set up the environment as described here. One change I made is that I used conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
to install the newest version of pytorch as I ran it on an RTX 3090.
I think this is an issue with Eigen or other non-python dependencies and maybe this is because of a version discrepancy. Perhaps can you try setting up the env using these instructions?
I am having a problem with DataLoader. Could you please help me with the fix?