nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
389 stars 120 forks source link

Chunks.npy and Dataset.py not being generated #385

Open Sgreenfield9 opened 5 months ago

Sgreenfield9 commented 5 months ago

Hello, I'm trying to train a basecaller using DNA that has been run through an RNA pore. When I run the following code:

bonito basecaller dna_r10.4.1_e8.2_260bps_sup@v4.1.0 --min-accuracy-save-ctc 0 --reference /home/remote /data/minknow/PolyA_DNA_SG/PolyA_DNA_SG/20240320_1335_P2S-01618-B_PAU71604_94a542e0/fast5_pass > /home/remote/basecalls.sam

I receive the following output:

`> calling: 100%|###########################################9| 8969/8979 [15:08<0 > completed reads: 8979

duration: 0:15:13 samples per second 1.6E+06 done`

No errors being thrown so I assume everything is going fine. The issue arrises when I try to run the subsequent bonito train command:

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4.1_e8.2_260bps_sup@v4.1.0 --directory /home/remote /home/remote/fine-tuned-model `[loading model] [using pretrained model dna_r10.4.1_e8.2_260bps_sup@v4.1.0] [loading data] Traceback (most recent call last): File "/home/remote/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 58, in main train_loader_kwargs, valid_loader_kwargs = load_numpy( File "/home/remote/.local/lib/python3.8/site-packages/bonito/data.py", line 40, in load_numpy train_data = load_numpy_datasets(limit=limit, directory=directory) File "/home/remote/.local/lib/python3.8/site-packages/bonito/data.py", line 66, in load_numpy_datasets chunks = np.load(os.path.join(directory, "chunks.npy"), mmap_mode='r') File "/home/remote/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: '/home/remote/chunks.npy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/remote/.local/bin/bonito", line 8, in sys.exit(main()) File "/home/remote/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/remote/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 62, in main train_loader_kwargs, valid_loader_kwargs = load_script( File "/home/remote/.local/lib/python3.8/site-packages/bonito/data.py", line 31, in load_script spec.loader.exec_module(module) File "", line 844, in exec_module File "", line 980, in get_code File "", line 1037, in get_data FileNotFoundError: [Errno 2] No such file or directory: '/home/remote/dataset.py'`

When I look at the directory I wrote my files to /home/remote I find that only a .sam file has been generated but chunks.npy has not. Is my chunks.npy file not being written or is it being written to another location? Any help would be greatly appreciated.

lkwhite commented 5 months ago

Hi all, I am also trying to troubleshoot this issue with Sam, and it's a bit unclear what files should be generated during this step.

bonito basecaller dna_r9.4.1 --save-ctc --reference reference.mmi /data/reads > /data/training/ctc-data/basecalls.sam

Is there a test dataset available that users can work through from the beginning to see what the expected outputs of calling bonito basecaller should be? I see that you have a pre-prepared dataset users can try if they don't have their own reads, but in this case we are trying to prepare our own reads and understand what these errors mean. It's a bit confusing since I do not see any reference to dataset.py in the source code itself.

iiSeymour commented 5 months ago

@Sgreenfield9 your reads need to map to your reference for any training data to be created.

Can you confirm you reads map? You can check in the basecalls_summary.csv that is created.

Also, note that you seem to be passing /home/remote as --reference which looks like a mistake.