I am trying to reproduce PointPillar model on nuscenes trainval dataset, however, I am getting the following error:
python3 -m torch.distributed.launch --nproc_per_node=2 train.py --launcher pytorch --cfg_file cfgs/nuscenes_models/cbgs_pp_multihead.yaml
...
...
2020-12-01 19:35:20,097 INFO **********************Start training nuscenes_models/cbgs_pp_multihead(default)**********************
epochs: 0%| | 0/20 [20:22<?, ?it/s, loss=0.86, lr=0.000101]
/home/nsirons/OpenPCDet_fork/pcdet/datasets/nuscenes/nuscenes_dataset.py:78: RuntimeWarning: invalid value encountered in less | 1370/10299 [13:47<1:29:13, 1.67it/s, total_it=1370]
mask = ~((np.abs(points[:, 0]) < center_radius) & (np.abs(points[:, 1]) < center_radius))
/home/nsirons/OpenPCDet_fork/pcdet/datasets/augmentor/augmentor_utils.py:76: RuntimeWarning: overflow encountered in multiply
points[:, :3] *= noise_scale
epochs: 0%| | 0/20 [20:31<?, ?it/s, loss=nan, lr=0.000101]WARNING:root:NaN or Inf found in input tensor. | 1386/10299 [13:57<1:27:45, 1.69it/s, total_it=1386]
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
I tried to train the same model on nuscenes-mini, and no problems have occurred.
Moreover, I run an eval on pre-trained model, and got the same results as in README.
At some point, np.fromfile() returns an array with the bottom part (like last 10k rows) filled with incorrect values including np.nan .
I have tried to open each binary file separately with the same line of code, with the same python setup (not during training), and everything was ok. So this problem happens ONLY during training.
I have updated NumPy from 1.18.4 to 1.19.4, it didn't help.
I have tried to re-read the bin file twice if the first time was with np.nan, it sometimes reads it without np.nan but sometimes it still does have np.nan. So reading multiple times is not a good option. An interesting observation that, when the same file fails twice, it fails (deviates from the original array) at the same row.
I am trying to reproduce PointPillar model on nuscenes trainval dataset, however, I am getting the following error:
I tried to train the same model on nuscenes-mini, and no problems have occurred.
Moreover, I run an eval on pre-trained model, and got the same results as in README.
So I am not sure what or which part is exactly broken.
Python packages versions:
Update: With further investigation, I have noticed that the source of error is the following line: https://github.com/open-mmlab/OpenPCDet/blob/a7cf5368d9cbc3969b4613c9e61ba4dcaf217517/pcdet/datasets/nuscenes/nuscenes_dataset.py#L82
At some point,
np.fromfile()
returns an array with the bottom part (like last 10k rows) filled with incorrect values including np.nan . I have tried to open each binary file separately with the same line of code, with the same python setup (not during training), and everything was ok. So this problem happens ONLY during training. I have updated NumPy from 1.18.4 to 1.19.4, it didn't help. I have tried to re-read the bin file twice if the first time was with np.nan, it sometimes reads it without np.nan but sometimes it still does have np.nan. So reading multiple times is not a good option. An interesting observation that, when the same file fails twice, it fails (deviates from the original array) at the same row.It happens when I use 1 GPU or multiple GPUs.