nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
382 stars 118 forks source link

Bonito model training using WSL2: "RuntimeError: CUDA error: unknown error" #249

Open jhammery opened 2 years ago

jhammery commented 2 years ago

Hallo everybody,

As Windows 11 and Windows 10, version 21H2 support PyTorch using NVIDIA CUDA for GPU hardware acceleration inside WSL2 (https://docs.microsoft.com/de-de/windows/ai/directml/gpu-cuda-in-wsl), we wanted to test running bonito. Bonito basecalling worked perfectly with the following command bonito basecaller dna_r10.4_e8.1_sup@v3.4 --recursive Fast5_files/ > basecalls.fastq. The Ubuntu 20.04.4 LTS app needed to be opened with administrator privileges to get it started.

As a next step we wanted to train a pre-existing model with our own data. First we re-basecalled the data using the following command: bonito basecaller dna_r10.4_e8.1_sup@v3.4 --save-ctc --reference /home/domi/reference_genomes/reference.mmi /home/domi/bonito_training/fast5 > /home/domi/bonito_training/basecalls.sam It also worked perfectly.

However, running the following command resulted in a CUDA error: bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model

[loading data] [validation set not found: splitting training set] [loading model] [using pretrained model dna_r10.4_e8.1_sup@v3.4] [0/166161]: 0%| | [00:00] Traceback (most recent call last): File "/home/domi/.local/bin/bonito", line 8, in sys.exit(main()) File "/home/domi/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/domi/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 98, in train_onestep scores = self.model(data_) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 166, in forward return self.encoder(x) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/nn.py", line 41, in forward return super().forward(x) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/nn.py", line 178, in forward y, h = self.rnn(x) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 691, in forward result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers, RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Running CUDA_LAUNCH_BLOCKING=1 bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model did not solve the issue.

Probably the error is connected to issue #233 that has been reported earlier.

I am thankful for any suggestions on how to solve this issue.

I use the following GPU: NVIDIA GeForce RTX3060 Ti.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 512.15       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   47C    P8    12W / 220W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
jhammery commented 2 years ago

Update: I was able to exactly reproduce the error described in issue #233 by empirically lowering the batch size to 19.

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model -f --batch 19

[loading data] [validation set not found: splitting training set] [loading model] [using pretrained model dna_r10.4_e8.1_sup@v3.4] [0/166161]: 0%| | [00:00]Error - an illegal memory access was encountered

Traceback (most recent call last): File "/home/domi/.local/bin/bonito", line 8, in sys.exit(main()) File "/home/domi/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/domi/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 99, in train_onestep losses = self.criterion(scores, targets, lengths_) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 177, in loss return self.seqdist.ctc_loss(scores.to(torch.float32), targets, target_lengths, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 122, in ctc_loss logz = logZ_cu(stay_scores, move_scores, target_lengths + 1 - self.state_len) File "/home/domi/.local/lib/python3.8/site-packages/koi/ctc.py", line 115, in logZ_cu return LogZ.apply(stay_scores, move_scores, target_lengths, _simple_lattice_fwd_bwd_cu, S) File "/home/domi/.local/lib/python3.8/site-packages/koi/ctc.py", line 53, in forward g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Batch sizes smaller than 19 also resulted in the same error, batch sizes greater than 20 resulted in the "RuntimeError: CUDA error: unknown error" described in my original post.

N0toriou5 commented 2 years ago

I got the same error running Bonito train with the following basic command:

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4_e8.1_sup@v3.4 --directory ctc-data fine-tuned-model

on a WSL2 Ubuntu 20.04.4 LTS, Windows 10 21H2, buld 19044.1706. Did you manage to solve this issue?

My CUDA compilation tool is updated to the V11.7.64 release, and I am using the following GPU: NVIDIA GeForce RTX3080 10 GB.

jhammery commented 2 years ago

@N0toriou5 Unfortunately, I have not been able to solve the problem, yet. I am still hoping for a solution provided by ONT as I have the feeling that many users are experiencing the same problem.

vellamike commented 2 years ago

@jhammery @N0toriou5 this issue is now fixed (see my comment in #275 for instructions on how to get a version of Bonito which solves the problem - you will need to install Bonito from source but this is simple).