ucla-mobility / V2V4Real

[CVPR2023 Highlight] The official codebase for paper "V2V4Real: A large-scale real-world dataset for Vehicle-to-Vehicle Cooperative Perception"
Other
192 stars 12 forks source link

IndexError: list index out of range, during training #13

Closed eddyhkchiu closed 1 year ago

eddyhkchiu commented 1 year ago

Hi,

So far I am able to run the test command and get the same object detection AP numbers using the pre-trained models. But I got IndexError related to the data loader when training the model from scratch.

When running the distributed training command, I got got IndexError immediately at the first step of training: | distributed init (rank 1): env:// | distributed init (rank 2): env:// | distributed init (rank 0): env:// | distributed init (rank 3): env:// -----------------Dataset Building------------------ Traceback (most recent call last): File "opencood/tools/train.py", line 207, in main() File "opencood/tools/train.py", line 49, in main shuffle=False) File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 91, in init self.num_samples = math.ceil(len(self.dataset) / self.num_replicas) # type: ignore[arg-type] File "/home/eddy/V2V4Real/opencood/data_utils/datasets/basedataset.py", line 104, in len return self.len_record[-1] IndexError: list index out of range

When running the single GPU training command, the training is able to complete one epoch. After that I got a similar IndexError when the validation data loader is used right after one epoch of training: Training start learning rate 0.0002000 [epoch 0][1776/1776], || Loss: 2.3861 || Conf Loss: 0.4103 || Loc Loss: 1.9757: 100%|█| 1776/1776 [3 Traceback (most recent call last): File "opencood/tools/train.py", line 207, in main() File "opencood/tools/train.py", line 189, in main for i, batch_data in enumerate(val_loader): File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 438, in iter return self._get_iterator() File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1086, in init self._reset(loader, first_iter=True) File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1119, in _reset self._try_put_index() File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1353, in _try_put_index index = self._next_index() File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 642, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 237, in iter sampler_iter = iter(self.sampler) File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 76, in iter return iter(range(len(self.data_source))) File "/home/eddy/V2V4Real/opencood/data_utils/datasets/basedataset.py", line 104, in len return self.len_record[-1] IndexError: list index out of range [epoch 0][1776/1776], || Loss: 2.3861 || Conf Loss: 0.4103 || Loc Loss: 1.9757: 100%|█| 1776/1776 [3

I wonder whether there are some problems related to the data loader or the reinitialize() function? Thanks!

DerrickXuNu commented 1 year ago

Hi, I test the code on several of my workstations, it just works fine. Are you able to run the vis_data_sequence.py on both train and val dataset?

eddyhkchiu commented 1 year ago

Hi, Yes, I am able to run the vis_data_sequence.py on both train and val dataset. I can see the Open3D's point clouds and bounding boxes. The train dataset has 7105 samples and the val set has 1994 samples. The code vis_data_sequence.py does not terminate by itself due to the while True infinite loop at https://github.com/ucla-mobility/V2V4Real/blob/main/opencood/visualization/vis_utils.py#L695. So I think my datasets and the data paths in the related config yaml files are correct. Thanks!

onepeachbiubiubiu commented 1 year ago

The value of "validate_dir" should be "test" instead of "validate".

Hi, Yes, I am able to run the vis_data_sequence.py on both train and val dataset. I can see the Open3D's point clouds and bounding boxes. The train dataset has 7105 samples and the val set has 1994 samples. The code vis_data_sequence.py does not terminate by itself due to the while True infinite loop at https://github.com/ucla-mobility/V2V4Real/blob/main/opencood/visualization/vis_utils.py#L695. So I think my datasets and the data paths in the related config yaml files are correct. Thanks!

The value of "validate_dir" in config yaml should be "test" instead of "validate".

eddyhkchiu commented 1 year ago

Hi onepeachbiubiubiu , Thank you very much! This solves my problem. My training config had the incorrect validate_dir setting as you pointed out.

YuJiXYZ commented 1 year ago

Hi, eddyhkchiu I hope I didn't disturb you. How did you solve the visualization problem of Open3D mentioned in question 10? Looking forward to your reply. Thank you~

eddyhkchiu commented 1 year ago

Hi YuJiXYZ,

Originally I was using my mac to ssh to Google cloud compute engine instance with Xquartz for visualization. This approach does not work for Open3D, which requires OpenGL 4.1. But mac only supports OpenGL 2.1 in general.

My solution is to setup remote desktop in Google cloud compute engine instance by following https://cloud.google.com/architecture/chrome-desktop-remote-on-compute-engine .

Hope this helps.

YuJiXYZ commented 1 year ago

Ok, thank you. (●'◡'●)

hongzhitao commented 1 month ago

HI @eddyhkchiu, when I run the visualization program, I encounter the "list index out of range" issue. I am using the OPV2V format V2V4Real dataset, and I have also changed the path in the YAML file to the absolute path of the test dataset. The error occurs at this line of code: https://github.com/ucla-mobility/V2V4Real/blob/main/opencood/visualization/vis_utils.py#L695. During debugging, I found that the variable aabbs is an empty list below this line of code. I am not sure why this happens. I hope to get your help!