Closed sjaenick closed 1 month ago
Hi @sjaenick,
Thanks for your report. The relevant lines here are:
[2024-02-27 12:39:17.352] [warning] Auto batchsize detection failed. Less than 1GB GPU memory available.
[2024-02-27 12:39:17.355] [info] - set batch size for cuda:0 to 64
[2024-02-27 12:39:17.453] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.76 GiB total capacity; 89.90 MiB already allocated; 612.12 MiB free; 132.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::OutOfMemoryError'
Dorado is telling you that you have < 1GB of memory available and that it has set the batch size to 64 (which is the minimum batch size possible due to the granularity of the CUDA kernels). It turns out that this still requires > 1GB of memory, which is why this is failing. We can certainly improve things by checking that the selected batch size will not require more memory than is available and providing a less verbose error message than libtorch does in this instance.
Thanks for the fast reply; I guess any kind of meaningful message and an exit code != 0 (but no segfault) would be fine.
Issue Report
Please describe the issue:
With one dorado instance already running and processing data, I attempted to start a second one, assuming that either the GPU would be shared or to receive an error that the CUDA device is busy. Instead, the second instance segfaulted, while the first one continued to process data.
Run environment:
dorado basecaller --emit-fastq --min-qscore 15 --trim primers --sample-sheet sample_sheet_FAX85449_20231220_1156_12b9e8fa.csv sup pod5_pass