vocalpy / vak

A neural network framework for researchers studying acoustic communication
https://vak.readthedocs.io
BSD 3-Clause "New" or "Revised" License
78 stars 16 forks source link

BUG: Running vak 1.0.0a1 with device set to CPU crashes #687

Closed NickleDave closed 1 year ago

NickleDave commented 1 year ago

Describe the bug Running vak train with version 1.0.0a1 and the device set to 'cpu' causes a crash, as described in this forum post: https://forum.vocalpy.org/t/errors-in-training-and-predicting-when-the-newest-version-of-vak-is-installed/71

It produces the following traceback:

$ vak train data/configs/TeenyTweetyNet_train_audio_cbin_annot_notmat.toml
2023-08-09 10:37:15,262 - vak.cli.train - INFO - vak version: 1.0.0a1
2023-08-09 10:37:15,263 - vak.cli.train - INFO - Logging results to ../vak-vocalpy/tests/data_for_tests/generated/results/train/audio_cbin_annot_notmat/TeenyTweetyNet/results_230809_103715
2023-08-09 10:37:15,263 - vak.core.train - INFO - Loading dataset from .csv path: ../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv
2023-08-09 10:37:15,266 - vak.core.train - INFO - Size of timebin in spectrograms from dataset, in seconds: 0.002
2023-08-09 10:37:15,266 - vak.core.train - INFO - using training dataset from ../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv
2023-08-09 10:37:15,266 - vak.core.train - INFO - Total duration of training split from dataset (in s): 50.046
2023-08-09 10:37:15,380 - vak.core.train - INFO - number of classes in labelmap: 12
2023-08-09 10:37:15,380 - vak.core.train - INFO - no spect_scaler_path provided, not loading
2023-08-09 10:37:15,380 - vak.core.train - INFO - will normalize spectrograms
2023-08-09 10:37:15,449 - vak.core.train - INFO - Duration of WindowDataset used for training, in seconds: 50.046
2023-08-09 10:37:15,460 - vak.core.train - INFO - Total duration of validation split from dataset (in s): 15.948
2023-08-09 10:37:15,461 - vak.core.train - INFO - will measure error on validation set every 50 steps of training
2023-08-09 10:37:15,467 - vak.core.train - INFO - training TeenyTweetyNet
Traceback (most recent call last):
  File "/home/pimienta/miniconda3/envs/vak-env/bin/vak", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/__main__.py", line 48, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/cli.py", line 49, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/cli.py", line 8, in train
    train(toml_path=toml_path)
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/cli/train.py", line 67, in train
    core.train(
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/core/train.py", line 358, in train
    trainer = get_default_trainer(
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/vak/trainer.py", line 66, in get_default_trainer
    trainer = lightning.Trainer(
              ^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/utilities/argparse.py", line 69, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 398, in __init__
    self._accelerator_connector = _AcceleratorConnector(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 140, in __init__
    self._check_config_and_set_final_flags(
  File "/home/pimienta/miniconda3/envs/vak-env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 221, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: You selected an invalid accelerator name: `accelerator=None`. Available names are: auto, cpu, cuda, hpu, ipu, mps, tpu.

As reported in the forum post, this probably affects all CLI commands besides prep: predict, eval, learncurve -- I did not verify though.

To Reproduce Steps to reproduce the behavior:

  1. Install vak 1.0.0a1 e.g. with conda
    $ mamba create -n vak-env python vak -c pytorch -c conda-forge
  2. Create a TOML configuration file with a [TRAIN] table that specifies device = 'cpu' (full file is attached)
    [TRAIN]
    model = "TeenyTweetyNet"
    normalize_spectrograms = true
    batch_size = 4
    num_epochs = 2
    val_step = 50
    ckpt_step = 200
    patience = 3
    num_workers = 2
    device = "cpu"
    root_results_dir = "../vak-vocalpy/tests/data_for_tests/generated/results/train/audio_cbin_annot_notmat/TeenyTweetyNet"
    dataset_path = "../vak-vocalpy/tests/data_for_tests/generated/prep/train/audio_cbin_annot_notmat/TeenyTweetyNet/032312_prep_230809_103332.csv"
  3. Run vak train config.toml
  4. See error

Expected behavior vak train should run without crashing

Desktop (please complete the following information):

Additional context What's going on here is that:

NickleDave commented 1 year ago

Quoting myself from the forum post:

I need to think a little more about the right way to fix this. People should be able to say they want to use 'cpu' explicitly, so I don’t think it should be either 'cuda' or 'auto'. It might be better to just let people specify the 'accelerator' option directly in the config, file although this forces people to learn about Lightning.

From the traceback above I can see that 'cpu' is actually a valid option for accelerator.
So I think now I am actually +1 on just changing device to accelerator and linking to the Lightning docs e.g. in docstrings.
We should favor being a lightweight transparent wrapper whenever possible instead of coming up with new conventions

JacquelineGoe commented 1 year ago

Thank you for the nice solution, this will be very easy to implement.

NickleDave commented 1 year ago

Thank you @JacquelineGoe for reporting this bug on the forum!
I will be sure to add you as a contributor when we fix.
Will move this near the top of the to-do list -- we need to make a new alpha release soon anyway

NickleDave commented 1 year ago

I put in a temporary fix for this #693. I will need to publish a new release to PyPI before the fix is in the version installed by pip and conda. The long-term fix is as described in #691

NickleDave commented 1 year ago

@all-contributors please add @JacquelineGoe for bug

allcontributors[bot] commented 1 year ago

@NickleDave

I've put up a pull request to add @JacquelineGoe! :tada:

NickleDave commented 6 months ago

Fixed by #752