`KeyError` in `epoch_best_postprocessing_or_default()`

cifkao commented 3 months ago
I'm trying to run the benchmark but it crashes on the dcase2016_task2 task. After training for what seems like 229 epochs, at the prediction stage, I get a KeyError trying to access the postprocessing parameters at epoch 240:
predict - dcase2016_task2 - 2024-08-01 09:19:18,874 - 874 -  result: [0.1666666716337204, 29, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_di
m": 1024, "hidden_layers": 2, "hidden_norm": "<class 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.0032, "max_epochs": 500, "norm_after_activation": false, "optim":
 "<class 'torch.optim.adam.Adam'>", "patience": 20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.19771863520145416, 39, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.00032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.19354838132858276, 59, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.00032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
 20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1901140660047531, 269, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 1, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.00032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.18285714089870453, 139, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 1, "hidden_norm": "<
class 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.0001, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience":
20}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1807909607887268, 69, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 1, "hidden_norm": "<cl
ass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.001, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 20}
, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1732580065727234, 29, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<cl
ass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.001, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 20
}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.1666666716337204, 29, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<cl
ass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_uniform_ at 0x7fd89f389830>", "lr": 0.0032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 2
0}, [["median_filter_ms", 250], ["min_duration", 125]]]
Grid Point Summary: [0.16030533611774445, 19, {"batch_size": 1024, "check_val_every_n_epoch": 10, "dropout": 0.1, "embedding_norm": "<class 'torch.nn.modules.linear.Identity'>", "hidden_dim": 1024, "hidden_layers": 2, "hidden_norm": "<c
lass 'torch.nn.modules.batchnorm.BatchNorm1d'>", "initialization": "<function xavier_normal_ at 0x7fd89f3898c0>", "lr": 0.0032, "max_epochs": 500, "norm_after_activation": false, "optim": "<class 'torch.optim.adam.Adam'>", "patience": 2
0}, [["median_filter_ms", 250], ["min_duration", 125]]]
grid: 8it [1:59:58, 899.87s/it]
predict - dcase2016_task2 - 2024-08-01 09:19:18,874 - 874 - Best Grid Point Validation Score: 0.19771863520145416  Grid Point HyperParams: {'batch_size': 1024, 'check_val_every_n_epoch': 10, 'dropout': 0.1, 'embedding_norm': <class 'tor
ch.nn.modules.linear.Identity'>, 'hidden_dim': 1024, 'hidden_layers': 2, 'hidden_norm': <class 'torch.nn.modules.batchnorm.BatchNorm1d'>, 'initialization': <function xavier_normal_ at 0x7fd89f3898c0>, 'lr': 0.00032, 'max_epochs': 500, '
norm_after_activation': False, 'optim': <class 'torch.optim.adam.Adam'>, 'patience': 20}
split: 0it [00:00, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84000/84000 [00:00<00:00, 140181.00it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84000/84000 [00:01<00:00, 59876.66it/s]
Getting embeddings for split ['test'], which has 84000 instances.███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏          | 79197/84000 [00:01<00:00, 60485.99it/s]
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, r
ead https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Restoring states from the checkpoint path at logs/embeddings/mymodel/dcase2016_task2-hear2021-full/lightning_logs/version_4/checkpoints/epoch=39-step=10320.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3]
Loaded model weights from checkpoint at logs/embeddings/mymodel/dcase2016_task2-hear2021-full/lightning_logs/version_4/checkpoints/epoch=39-step=10320.ckpt
/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:229: PossibleUserWarning: The dataloader, test_dataloader 0, does not have many workers which may be a bottleneck.
Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  category=PossibleUserWarning,
  0%|                                                                                                                                                                                                               | 0/6 [2:00:03<?, ?it/s]
Traceback (most recent call last):
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/runner.py", line 181, in <module>
    runner()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/runner.py", line 148, in runner
    logger=logger,
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 1411, in task_predictions
    in_memory=in_memory,
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 1106, in task_predictions_test
    ckpt_path=grid_point.model_path, dataloaders=test_dataloader
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 795, in test
    self, self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1188, in _run_stage
    return self._run_evaluate()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1228, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
    output = self.on_run_end()
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 180, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 288, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
  File "/home/ondrej/mambaforge/envs/heareval/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 305, in test_epoch_end
    self._score_epoch_end("test", outputs)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 467, in _score_epoch_end
    postprocessing_cached = self.epoch_best_postprocessing_or_default(epoch)
  File "/home/ondrej/proj/sandbox/heareval/src/heareval/heareval/predictions/task_predictions.py", line 431, in epoch_best_postprocessing_or_default
    return self.epoch_best_postprocessing[epoch]
KeyError: 240
Testing DataLoader 0: 100%|██████████| 83/83 [00:02<00:00, 34.43it/s]
I'm using a conda environment. I have pytorch-lightning==1.9.5, torch==1.13.1 and scikit-learn==1.0.2.
theMoro commented 3 months ago
I have the same problem. Have you already solved it? :)
theMoro commented 3 months ago
I have now found the problem and a solution to it.
They want to set the current_epoch attribute of the PyTorch Lightning Trainer variable by calling: https://github.com/hearbenchmark/hear-eval-kit/blob/855964977238e89dfc76394aa11c37010edb6f20/heareval/predictions/task_predictions.py#L1102
To get the wanted outcome, change this line to: trainer.fit_loop.epoch_progress.current.completed = grid_point.epoch. This actually changes the value you get when calling self.current_epoch in _score_epoch_end (line 464).
Another solution would probably be to just set a new variable of the trainer and then retrieve the value of that variable where you need it.
cifkao commented 3 months ago
Thanks @theMoro, that fixed the problem for me!
hearbenchmark / hear-eval-kit

`KeyError` in `epoch_best_postprocessing_or_default()` #432