Closed athenasyarifa closed 9 months ago
Hi @athenasyarifa, I'm sorry that vak is crashing here.
Thank you for the detailed error report. Based on what I see in the console output and the config file you provided, I don't see a reason to think this is a beginner issue.
My guess is that it's just the case that one of the spectrograms ends up being too big to fit on the GPU, and that causes the OOM error.
The quickest way to rule this out might be to make the files smaller and see if we still get the crash. I know you said you tried that already. Just to follow up:
DATASET_DIR
at the top changed to point to the copy you make. What it does is drop all the files from the dataset csv that have a duration >= the file that causes the crash. Since they are no longer in the csv, vak won't use them when you run vak predict
-- it iterates through the rows of that csvDATASET_DIR = pathlib.Path(
'tests/data_for_tests/generated/prep/predict/audio_cbin_annot_notmat/TweetyNet/032412-vak-frame-classification-dataset-generated-231010_165729/'
)
FILE_NUMBER_THAT_CAUSES_CRASH = 5
metadata = vak.datasets.frame_classification.Metadata.from_dataset_path(DATASET_DIR)
dataset_csv_path = DATASET_DIR / metadata.dataset_csv_filename
df = pd.read_csv(dataset_csv_path)
max_dur = df.loc[FILE_NUMBER_THAT_CAUSES_CRASH, 'duration']
new_df = df[df.duration < max_dur]
assert len(new_df) < len(df) # make sure that worked
assert new_df['duration'].max() < max_dur # make extra extra sure, because we are scientists
dataset_path
in your config file to point to the copy of the dataset where you changed the csv itself with the script, and then see if you still get an OOM error If you don't get the OOM error then, we can confirm that the issue is file size.
Assuming that is the problem, there's a couple things we could try:
freq_cutoffs
option: https://vak.readthedocs.io/en/latest/reference/config.html#vak.config.spect_params.SpectParamsConfig.freq_cutoffsso if you knew that your sound of interest is always between 500-12000 Hz you could do
[SPECT_PARAMS]
fft_size = 512
step_size = 64
freq_cutoffs = [500, 12000]
I wish we had better numbers on the durations / memory use already to give you -- I did some quick tests to give you a ballpark:
Looks like your GPU has 2 GB according to the error from pytorch
?
Do you see the same thing if you run nvidia-smi
in the terminal?
Again, really sorry we don't have a more general solution ready, we know this is an issue -- see for example #514. Ideally we'd estimate based on a user's hardware what we can fit on the GPU and then work with that.
In the meantime I'm happy to work with you to find a workaround.
Please let me know how what you figure out about the file size, I can answer more questions too if needed
Hi @NickleDave Thank you so much for your prompt response!
Did you happen to work through this tutorial and if so were you able to run predict on all the files in that dataset? That at least tells us you can predict on some files, and also gives us a lower bound on the largest size file you can predict on.
Yes, I tried working through the tutorial today and was able to quickly run every step successfully.
In the console output you provided, it looks like the crash happened on file 5. What we want to figure out is if we can run predict if you remove that file and all files larger than it. Here is how you might do that.
I tried to do what you suggested here and indeed I think the problem is the file size. What I did was, I copied my predict dataset along with the metadata.json
and the predict_prep_240227_171934.csv
in a new troubleshooting folder. Then, I run the script you gave me (I added a couple of other lines for anyone else having the same problem):
import pathlib
import vak
import pandas as pd
DATASET_DIR = pathlib.Path(
'/mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/vak_troubleshoot/data'
)
FILE_NUMBER_THAT_CAUSES_CRASH = 5
metadata = vak.datasets.frame_classification.Metadata.from_dataset_path(DATASET_DIR)
dataset_csv_path = DATASET_DIR / metadata.dataset_csv_filename
df = pd.read_csv(dataset_csv_path)
max_dur = df.loc[FILE_NUMBER_THAT_CAUSES_CRASH, 'duration']
new_df = df[df.duration < max_dur]
assert len(new_df) < len(df) # make sure that worked
assert new_df['duration'].max() < max_dur # make extra extra sure, because we are scientists
new_df.to_csv(dataset_csv_path)
Then, I copy the rewritten csv
file into the original predict dataset, and I rerun vak predict
. I ran into another problem which looks like the following:
2024-02-28 11:53:45,427 - vak.cli.predict - INFO - vak version: 1.0.0a3
2024-02-28 11:53:45,427 - vak.cli.predict - INFO - Logging results to /mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/predict
2024-02-28 11:53:45,444 - vak.predict.frame_classification - INFO - loading SpectScaler from path: /mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/results_240227_155358/StandardizeSpect
2024-02-28 11:53:45,448 - vak.predict.frame_classification - INFO - loading labelmap from path: /mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/results_240227_155358/labelmap.json
2024-02-28 11:53:45,466 - vak.predict.frame_classification - INFO - loading dataset to predict from csv path: /mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/predict/predict-vak-frame-classification-dataset-generated-240227_171934/predict_prep_240227_171934.csv
2024-02-28 11:53:45,494 - vak.predict.frame_classification - INFO - will save annotations in .csv file: /mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/predict/willowtit_predict.annot.csv
2024-02-28 11:53:45,499 - vak.predict.frame_classification - INFO - Duration of a frame in dataset, in seconds: 0.00145
2024-02-28 11:53:45,575 - vak.predict.frame_classification - INFO - Shape of input to networks used for predictions: torch.Size([1, 257, 176])
2024-02-28 11:53:45,576 - vak.predict.frame_classification - INFO - instantiating model from config:/nTweetyNet
2024-02-28 11:53:45,597 - vak.predict.frame_classification - INFO - loading checkpoint for TweetyNet from path: /mnt/c/Users/Lenovo/Documents/GitHub/willowtit-project/bioacoustic/vak_train/results_240227_155358/TweetyNet/checkpoints/max-val-acc-checkpoint.pt
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2024-02-28 11:53:46,587 - vak.predict.frame_classification - INFO - running predict method of TweetyNet
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 24/27 [00:05<00:00, 4.46it/s]Traceback (most recent call last):
File "/home/rifsyy/anaconda3/envs/vak_env/bin/vak", line 8, in <module>
sys.exit(main())
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/__main__.py", line 48, in main
cli.cli(command=args.command, config_file=args.configfile)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/cli/cli.py", line 54, in cli
COMMAND_FUNCTION_MAP[command](toml_path=config_file)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/cli/cli.py", line 22, in predict
predict(toml_path=toml_path)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/cli/predict.py", line 48, in predict
predict_module.predict(
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/predict/predict_.py", line 141, in predict
predict_with_frame_classification_model(
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/predict/frame_classification.py", line 239, in predict_with_frame_classification_model
results = trainer.predict(model, pred_loader)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 864, in predict
return call._call_and_handle_interrupt(
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 903, in _predict_impl
results = self._run(model, ckpt_path=ckpt_path)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
return self.predict_loop.run()
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/loops/prediction_loop.py", line 119, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
batch = super().__next__()
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
batch = next(self.iterator)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
out = next(self._iterator)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 132, in __next__
out = next(self.iterators[0])
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rifsyy/anaconda3/envs/vak_env/lib/python3.10/site-packages/vak/datasets/frame_classification/frames_dataset.py", line 82, in __getitem__
source_path = self.source_paths[idx]
IndexError: index 24 is out of bounds for axis 0 with size 24
Predicting DataLoader 0: 89%|████████▉ | 24/27 [00:05<00:00, 4.05it/s]
I am not sure where the error came from, but I was able to work around it by rerunning vak prep
after deleting the longest wav
file from the predict dataset, followed by rerunning vak predict
. But then, I still ran into another OOM error message, and I had to go through once more the troubleshooting step you suggested. In the end, I successfully ran vak predict
with 19 wav
files in my predict dataset ranging from 7 to 36 seconds.
use different spectrogram parameters to make the spectrogram smaller, e.g. by setting limits on the frequencies using the
freq_cutoffs
option:
I am checking the annotation output from vak predict
as I am writing this now. So I have not yet tried the frequency cutoffs solution you suggested here.
make clips of the audio, e.g. in Raven or with a Python script -- I can help with that if we need to
nor this solution. This will be the next step I will do after checking how TweetyNet prediction results for my dataset looks like. I happen to have a segmented version of my full dataset (previously segmented with warbleR
), which I can try as another predict dataset for vak
.
Looks like your GPU has 2 GB according to the error from pytorch? Do you see the same thing if you run nvidia-smi in the terminal?
Yes, I saw the same thing when I run nvidia-smi
, ashamed to say that I have low-spec personal laptop.
Again, many thanks for the help!
Best, Rifa
Hi @NickleDave and everyone, It's me again, sorry. I ran into a
torch.cuda.OutOfMemoryError
when I was runningvak predict
on my dataset. Please find below the error message:and my
predict.toml
file looks like this:I saw similar issue #301 or is it a different case? I tried removing longer files from my predict dataset, but it was still not working. My predict dataset are 27 wav files from 7 to 52 seconds. Can't help thinking this is probably a beginner issue! How can I fix this? Thanks in advance for your help!
Best, Rifa