Open jhammery opened 2 years ago
Update: I was able to exactly reproduce the error described in issue #233 by empirically lowering the batch size to 19.
bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model -f --batch 19
[loading data] [validation set not found: splitting training set] [loading model] [using pretrained model dna_r10.4_e8.1_sup@v3.4] [0/166161]: 0%| | [00:00]Error - an illegal memory access was encountered
Traceback (most recent call last): File "/home/domi/.local/bin/bonito", line 8, in
sys.exit(main()) File "/home/domi/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/domi/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 99, in train_onestep losses = self.criterion(scores, targets, lengths_) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 177, in loss return self.seqdist.ctc_loss(scores.to(torch.float32), targets, target_lengths, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 122, in ctc_loss logz = logZ_cu(stay_scores, move_scores, target_lengths + 1 - self.state_len) File "/home/domi/.local/lib/python3.8/site-packages/koi/ctc.py", line 115, in logZ_cu return LogZ.apply(stay_scores, move_scores, target_lengths, _simple_lattice_fwd_bwd_cu, S) File "/home/domi/.local/lib/python3.8/site-packages/koi/ctc.py", line 53, in forward g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Batch sizes smaller than 19 also resulted in the same error, batch sizes greater than 20 resulted in the "RuntimeError: CUDA error: unknown error" described in my original post.
I got the same error running Bonito train with the following basic command:
bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory ctc-data fine-tuned-model
on a WSL2 Ubuntu 20.04.4 LTS, Windows 10 21H2, buld 19044.1706. Did you manage to solve this issue?
My CUDA compilation tool is updated to the V11.7.64 release, and I am using the following GPU: NVIDIA GeForce RTX3080 10 GB.
@N0toriou5 Unfortunately, I have not been able to solve the problem, yet. I am still hoping for a solution provided by ONT as I have the feeling that many users are experiencing the same problem.
@jhammery @N0toriou5 this issue is now fixed (see my comment in #275 for instructions on how to get a version of Bonito which solves the problem - you will need to install Bonito from source but this is simple).
Hallo everybody,
As Windows 11 and Windows 10, version 21H2 support PyTorch using NVIDIA CUDA for GPU hardware acceleration inside WSL2 (https://docs.microsoft.com/de-de/windows/ai/directml/gpu-cuda-in-wsl), we wanted to test running bonito. Bonito basecalling worked perfectly with the following command
bonito basecaller dna_r10.4_e8.1_sup@v3.4 --recursive Fast5_files/ > basecalls.fastq
. The Ubuntu 20.04.4 LTS app needed to be opened with administrator privileges to get it started.As a next step we wanted to train a pre-existing model with our own data. First we re-basecalled the data using the following command:
bonito basecaller dna_r10.4_e8.1_sup@v3.4 --save-ctc --reference /home/domi/reference_genomes/reference.mmi /home/domi/bonito_training/fast5 > /home/domi/bonito_training/basecalls.sam
It also worked perfectly.However, running the following command resulted in a CUDA error:
bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model
Running
CUDA_LAUNCH_BLOCKING=1 bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model
did not solve the issue.Probably the error is connected to issue #233 that has been reported earlier.
I am thankful for any suggestions on how to solve this issue.
I use the following GPU: NVIDIA GeForce RTX3060 Ti.