nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
382 stars 118 forks source link

`> no suitable ctc data to write` when basecalling with `save-ctc` flag #308

Open HanielF opened 1 year ago

HanielF commented 1 year ago

Hi~ I am trying to train a bonito model from scratch.

To obtain my training data, i basecalled the reads with the command as below:

data_path="~/codes/bonito/data"
key="Acinetobacter_baumannii_AYP-A2"

bonito basecaller "dna_r9.4.1_e8.1_fast@v3.4" "${data_path}/train_data/${key}" \
   --save-ctc \
   --reference "${data_path}/train_data/${key}/read_references.fasta" \
   --batchsize 100 \
   --recursive \
   > "${data_path}/train_data/${key}/calls.sam"

It will raise an error, > no suitable ctc data to write, in bonito.io.CTCWriter.run(). It means that none of the reads can pass the checks. To improve the accuracy, I replaced the fast model with dna_r9.4.1_e8.1_sup@v3.3. Several hundred reads passed the checks this time.

Also, it's weird that only 413 reads were saved to CTC data, while a total of 100960 reads were input.

I checked the code in bonito.io.CTCWriter and found that most of the reads are filtered by self.min_accuracy, which is set to 0.99 by default. Here is the statistical results for ctc-data:

......
> ctc results filtering: : 100960it [03:49, 440.28it/s]
Continue! self.min_accuracy: 0.99, acc: 0.8582375478927203 
Continue! self.min_accuracy: 0.99, acc: 0.9616935483870968 
Continue! self.min_accuracy: 0.99, acc: 0.960446247464503 
Continue! self.min_accuracy: 0.99, acc: 0.943952802359882 
Continue! self.min_accuracy: 0.99, acc: 0.9434968017057569 
Continue! self.min_accuracy: 0.99, acc: 0.967741935483871 
Continue! self.min_accuracy: 0.99, acc: 0.9537401574803149 
=== Error analysis ===
seq_err: 0, mapping_err: 38, refseq_err: 0, acc_err: 100501, cov_err: 0
> written ctc training data
  - chunks.npy with shape (413,10000)
  - references.npy with shape (413,1053)
  - reference_lengths.npy shape (413)
> completed reads: 100960
> duration: 0:03:49

Obviously, the CTC training data is not enough even though that is only one of 50 genomes for the whole training set. Do I need to lower the value of the min_accuracy parameter?

andreaswallberg commented 7 months ago

Dear @HanielF and @iiSeymour

We appear to be in a similar position. Several millions of reads as input, but only tens of thousands of reads appear to be used. My expectation was that much more of the raw data could be used.

Any suggestions on how to overcome this?

mark-sor2 commented 1 month ago

Hi @iiSeymour,

I have same issue. I have Pod5 which containes multiplexed reads.

1) In a first experiment I generated susbet pod5 where each one contains reads related to one unique reference sequence (I used the read id after demultiplexing to perform this operation, although some reads were not found in the pod5 files).

bonito basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 --min-accuracy-save-ctc 0.7 --save-ctc --reference ./Fasta/myRef.mmi ./Pod5/myRefPod5/ > ./ctcData/myRef.sam

The basecalling stop after few seconds and doesn't seems to work, I got next result :

reading pod5 outputting aligned sam loading model dna_r10.4.1_e8.2_400bps_sup@v4.3.0 model basecaller params: {'batchsize': 32, 'chunksize': 9996, 'overlap': 492, 'quantize': None} loading reference read scaling: {'strategy': 'pa'} no suitable ctc data to write
completed reads: 61 duration: 0:00:05 samples per second 1.3E+05 done

1) In a second experiment I used all the pod5 (all the multiplexed reads) and provided one reference. I assumed that only the reads related to that reference will remain after the alignment operation. In this case the basecalling seems to works for a few minutes but then stops (at ~15% of the operaiton) and then it says "no suitable ctc data to write". below is the command I used and the resulting output :

bonito basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 --min-accuracy-save-ctc 0.7 --save-ctc --reference ./Fasta/myRef.mmi ./Pod5/allPod5/ > ./ctcData/myRef.sam

reading pod5 outputting aligned sam loading model dna_r10.4.1_e8.2_400bps_sup@v4.3.0 model basecaller params: {'batchsize': 96, 'chunksize': 9996, 'overlap': 492, 'quantize': None} loading reference read scaling: {'strategy': 'pa'} no suitable ctc data to write
completed reads: 78814 duration: 0:06:59 samples per second 1.9E+06 done

My reference sequence contain around 400 bases, and the '.mmi' file was generated using ">minimap2 -x map-ont -d ./Fasta/myRef.mmi ./Fasta/myRef.fasta" command. I do not understand why I cannot perform the basecallinf using --save-ctc flag. Note that the basecalling performs well when not using this flag.

thanks in advance for your answer