motional / nuplan-devkit

The devkit of the nuPlan dataset.
https://www.nuplan.org
Other
662 stars 126 forks source link

Failed to compute features for scenario token | EOFError: Ran out of input #319

Open tinkei opened 1 year ago

tinkei commented 1 year ago

Describe the bug

Training interrupted by error Failed to compute features for scenario token XXX in log XXX Error: Ran out of input. Looking at closed issues, I thought it's already fixed long back in v0.4.

Setup

Please share your setup with us, the more detail the better. For example, type of machine (laptop, cluster instance), linux distribution, no. of cpu, no. of gpus, RAM, VRAM, cuda version, conda environment, nuplan-devkit release version.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Run command python nuplan/planning/script/run_training.py ...
  2. Interrupt training
  3. Run command python nuplan/planning/script/run_training.py ... again, specifying the same cache directory
  4. Sometimes this error appears (rather annoying when it's towards the end of an epoch...)

Stack Trace

(nuplan) tk@tk-ubuntu:~/nuplan/nuplan-devkit$ python nuplan/planning/script/run_training.py     group=/home/tk/nuplan/my_experiments/experiment_v014_resnet_more     cache.cache_path=/home/tk/nuplan/my_experiments/cache     experiment_name=training_raster_experiment     job_name=train_default_raster     py_func=train     +training=training_raster_model     scenario_builder=nuplan_mini     scenario_filter.limit_total_scenarios=32000     lightning.trainer.params.accelerator=ddp     lightning.trainer.params.max_epochs=4     lightning.trainer.checkpoint.resume_training=true     data_loader.params.batch_size=80     data_loader.params.num_workers=8     logger_level=warning     optimizer.lr=8e-5     lr_scheduler=multistep_lr     lr_scheduler.milestones=[1,2,4,8,12,16]     warm_up_lr_scheduler=linear_warm_up     worker.threads_per_node=8
Global seed set to 0
2023-05-26 17:33:46,871 INFO worker.py:1625 -- Started a local Ray instance.
Ray objects: 100%|████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.01it/s]
/train_default_raster/2023.05.26.17.15.46/checkpoints/epoch=0.ckpt
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type        | Params
--------------------------------------
0 | model | RasterModel | 25.3 M
--------------------------------------
17.0 M    Trainable params
8.3 M     Non-trainable params
25.3 M    Total params
101.105   Total estimated model params size (MB)
/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/callback_hook.py:307: LightningDeprecationWarning: `Callback.on_load_checkpoint` signature has changed in v1.3. `trainer` and `pl_module` parameters have been added. Support for the old signature will be removed in v1.5
  rank_zero_deprecation(
Restored states from the checkpoint file at /home/tk/nuplan/my_experiments/experiment_v014_resnet_more/training_raster_experiment/train_default_raster/2023.05.26.17.15.46/checkpoints/epoch=0.ckpt
2023-05-26 17:34:06,712 ERROR {/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py:104}  Failed to compute features for scenario token 64217a7437a55598 in log 2021.08.17.17.17.01_veh-45_02314_02798
Error: Ran out of input
Epoch 1:   0%|                                                                                  | 0/326 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 93, in compute_features
    all_features, all_feature_cache_metadata = self._compute_all_features(scenario, self._feature_builders)
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 122, in _compute_all_features
    feature, feature_metadata_entry = compute_or_load_feature(
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/utils_cache.py", line 83, in compute_or_load_feature
    feature = storing_mechanism.load_computed_feature_from_folder(file_name, builder.get_feature_type())
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/feature_cache.py", line 88, in load_computed_feature_from_folder
    data = pickle.load(f)
EOFError: Ran out of input
/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Epoch 1:   1%|▋                                                       | 4/326 [00:15<20:47,  3.87s/it, loss=383, v_num=]
Error executing job with overrides: ['group=/home/tk/nuplan/my_experiments/experiment_v014_resnet_more', 'cache.cache_path=/home/tk/nuplan/my_experiments/cache', 'experiment_name=training_raster_experiment', 'job_name=train_default_raster', 'py_func=train', '+training=training_raster_model', 'scenario_builder=nuplan_mini', 'scenario_filter.limit_total_scenarios=32000', 'lightning.trainer.params.accelerator=ddp', 'lightning.trainer.params.max_epochs=4', 'lightning.trainer.checkpoint.resume_training=true', 'data_loader.params.batch_size=80', 'data_loader.params.num_workers=8', 'logger_level=warning', 'optimizer.lr=8e-5', 'lr_scheduler=multistep_lr', 'lr_scheduler.milestones=[1,2,4,8,12,16]', 'warm_up_lr_scheduler=linear_warm_up', 'worker.threads_per_node=8']
Traceback (most recent call last):
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/script/run_training.py", line 64, in main
    engine.trainer.fit(model=engine.model, datamodule=engine.datamodule)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self._run(model)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    self.dispatch()
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
    self.accelerator.start_training(self)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
    return self.run_train()
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
    self.train_loop.run_training_epoch()
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 491, in run_training_epoch
    for batch_idx, (batch, is_last_batch) in train_dataloader:
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/profiler/profilers.py", line 112, in profile_iterable
    value = next(iterator)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 534, in prefetch_iterator
    for val in it:
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 464, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 478, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/utilities/apply_func.py", line 85, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 5.
Original Traceback (most recent call last):
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 93, in compute_features
    all_features, all_feature_cache_metadata = self._compute_all_features(scenario, self._feature_builders)
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 122, in _compute_all_features
    feature, feature_metadata_entry = compute_or_load_feature(
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/utils_cache.py", line 83, in compute_or_load_feature
    feature = storing_mechanism.load_computed_feature_from_folder(file_name, builder.get_feature_type())
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/feature_cache.py", line 88, in load_computed_feature_from_folder
    data = pickle.load(f)
EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/data_loader/scenario_dataset.py", line 48, in __getitem__
    features, targets, _ = self._feature_preprocessor.compute_features(scenario)
  File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 106, in compute_features
    raise RuntimeError(msg)
RuntimeError: Failed to compute features for scenario token 64217a7437a55598 in log 2021.08.17.17.17.01_veh-45_02314_02798
Error: Ran out of input

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

It appears that this error arise after I keyboard-interrupt (Ctrl+C) a previous run of python nuplan/planning/script/run_training.py ..., when I attempt to run a new experiment that shares the same cache directory.

michael-motional commented 1 year ago

Hey @tinkei, thanks for reporting. The issue isn't fully resolved, but I'll make some changes to make it more robust. Two questions:

1) to sanity-check, are you using the pre-cached features? you should have arguments like

cache.cache_path={CACHE_PATH} cache.use_cache_without_dataset=True

2) I think what's happening is a feature cache file is being created, but doesn't get fully written to before training finishes. Could fix this by writing to a temp file, then moving the result once we compute and write the feature. Can you confirm this is the case by checking if an empty file exists for the feature that fails?

I'll also try to replicate. Thanks!

atakandag commented 1 year ago

Hello, is there an update on this? The same problem happens if you interrupt caching and rerun the caching with same cache directory.