np.fromfile returns invalid values on nuscenes trainval dataset

nsirons commented 3 years ago

I am trying to reproduce PointPillar model on nuscenes trainval dataset, however, I am getting the following error:

python3 -m torch.distributed.launch --nproc_per_node=2 train.py --launcher pytorch --cfg_file cfgs/nuscenes_models/cbgs_pp_multihead.yaml
...
...
2020-12-01 19:35:20,097   INFO  **********************Start training nuscenes_models/cbgs_pp_multihead(default)**********************
epochs:   0%|                                                                                                                                                     | 0/20 [20:22<?, ?it/s, loss=0.86, lr=0.000101]
/home/nsirons/OpenPCDet_fork/pcdet/datasets/nuscenes/nuscenes_dataset.py:78: RuntimeWarning: invalid value encountered in less                             | 1370/10299 [13:47<1:29:13,  1.67it/s, total_it=1370]
  mask = ~((np.abs(points[:, 0]) < center_radius) & (np.abs(points[:, 1]) < center_radius))
/home/nsirons/OpenPCDet_fork/pcdet/datasets/augmentor/augmentor_utils.py:76: RuntimeWarning: overflow encountered in multiply
  points[:, :3] *= noise_scale
epochs:   0%|                                                                                                                                                      | 0/20 [20:31<?, ?it/s, loss=nan, lr=0.000101]WARNING:root:NaN or Inf found in input tensor.                                                                                                             | 1386/10299 [13:57<1:27:45,  1.69it/s, total_it=1386]
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.

I tried to train the same model on nuscenes-mini, and no problems have occurred.

Moreover, I run an eval on pre-trained model, and got the same results as in README.

python3 -m torch.distributed.launch --nproc_per_node=2 test.py --launcher pytorch --cfg_file cfgs/nuscenes_models/cbgs_pp_multihead.yaml  --ckpt pp_multihead_nds5823_updated.pth

So I am not sure what or which part is exactly broken.

Python packages versions:

nuscenes-devkit        1.0.5
pcdet                  0.3.0+22c78cc
spconv                 1.2.1
torch                  1.5.0+cu101

Update: With further investigation, I have noticed that the source of error is the following line: https://github.com/open-mmlab/OpenPCDet/blob/a7cf5368d9cbc3969b4613c9e61ba4dcaf217517/pcdet/datasets/nuscenes/nuscenes_dataset.py#L82

At some point, np.fromfile() returns an array with the bottom part (like last 10k rows) filled with incorrect values including np.nan . I have tried to open each binary file separately with the same line of code, with the same python setup (not during training), and everything was ok. So this problem happens ONLY during training. I have updated NumPy from 1.18.4 to 1.19.4, it didn't help. I have tried to re-read the bin file twice if the first time was with np.nan, it sometimes reads it without np.nan but sometimes it still does have np.nan. So reading multiple times is not a good option. An interesting observation that, when the same file fails twice, it fails (deviates from the original array) at the same row.

It happens when I use 1 GPU or multiple GPUs.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

open-mmlab / OpenPCDet

np.fromfile returns invalid values on nuscenes trainval dataset #393