nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
394 stars 121 forks source link

Preparing training data for bonito #137

Open ktan8 opened 3 years ago

ktan8 commented 3 years ago

We've noticed some issues with the basecalling of some repeat elements on the PromethION dataset our lab generated. To address this, we plan to tune the original pre-trained model by feeding the model with additional examples of these sequences.

As highlighted in the main page, the way to prepare the training data for model tuning is to include the arguments '--save-ctc --reference reference.mmi ' at the basecalling step.

To understand how the training data is structured/prepared during this step, I had a rudimentary look at the bonito source code. From what I understand (and correct me if I am wrong), this is what happens: 1) Basecalls for the read is generated using the initial Bonito model. 2) The sequence generated from this initial Bonito model is then mapped to the provided reference sequence using mappy (the python version of minimap2). 3) The nanopore signal is saved as 'chunks.npy' in chunks of 4000 timeframes. At the same time, the correct 'target' sequence is extracted from the reference genome based on mapping of the bonito-basecalled-sequence. This correct 'target' sequence is stored in 'references.npy' (encoded as 1-4, representing A,C,G,T). The corresponding length of each sequence is stored as 'reference_lengths.npy'

This leads to me to a few questions: a) If the sequence of interest is missing in the reference genome/sequences (e.g. centromeres), then these reads will probably not be mapped. As such, am I right to say that no 'target' sequences would be generated for such reads?

b) Similarly if the basecalls from the initial bonito model is incorrect for a large stretch of the read, then these 'incorrect' regions will not map properly to the reference genome. As such, I'm guessing that these regions will either be unrepresented or represented incorrectly in the training data I created using the approach highlighted above?

Thanks for your help!

ktan8 commented 3 years ago

On further inspection, it seems to me that if the read does not map to the reference sequence/genome provided, then it will be absent from the training dataset.

lh3 commented 3 years ago

I want to echo @ktan8's point b). With Q20 chemistry, the accuracy of Nanopore is approaching the human heterozygosity. The differences between the reference and the training individual will start to matter. I guess high-quality training data may greatly help to push the accuracy further. These days, the best human sample for the training purpose is CHM13. It is homozygous and has a near perfect assembly through telomeres and centromeres where reads are often mismapped if you use GRCh38.

A practical concern is that the CHM13 is currently not commercially available, I think, but you can acquire DNA from research labs. T2T is also trying to bank the cell lines for more wide uses.

clive-g-brown commented 3 years ago

Hi, we're looking at moving to training on 'random' synthetic DNA [that has been verified on ILMN]. The synthetics call well so far using a reference model. but the idea is to go in the other direction, once we have this working we won't need references; and the training set +models can also be made available. They can replace the existing ones or augment them. I would like to then move to putting modified bases at a known positions, then include those in future models.

lh3 commented 3 years ago

That is great, and should address @ktan8's point a) as well. It would be good to also have enough low-complexity sequences in the training set. These low-complexity sequences can be STRs, VNTRs or something that "look simple". For example, the following is a real human insertion:

TAGCCCCCTCTCCTTTCTCCTCTCTATCCCCCTCTCCTTTCTCCCTCCATCCCCCTCTCCTTTCTCCTCTCCATCCCCCTCTCCTTTCTCCTCTCCATCCCCCTCTCCTTTCTCCTCTCCATCCCCCTCTCCTTTCTCCCTCTCTACCCCCCCTCCATCCCCCTCTCCTTTCTCCTCCCCACCCCCTCTCCTTTCTCCTCTCCAACCCCCTCTCCTTTCTCCTCTCCATCCTCCTCTTTCTCCCTCTCCATCCCCCTCTCCTTTCTCCTCTCCTTTCTCCTCCCCATCTCCTCTCCATCCCCCTCTCCATCCCCCCTCCATCCCCCTCTCCTTTCTCTTCCCCACCCCCTCTTTCTCCTCTCCATCCCCCTCTCCTTTCTCCCTCTCCACCCCCCTCTCCATCCCCCTCTCCTTTCTCCCTCTCCATCCCCCTCTCCATCCCCCTCTCCTTTCTCCTCTCCATCCTCCTCTTTCTCCCTCTCCATCCCCCTCTCCATCCCCCTCTCCTTTCTCCTCTCCATCCTCCTCTTTCTCCCTCTCCATCC

A few years ago when I looked at Nanopore data, the base accuracy tended to get lower in such regions. Not sure about now.

jackwadden commented 3 years ago

@ktan8 I'm facing a similar issue. One possible workaround (hack) might be to just concatenate your ground truth sequences to the input reference. I can't imagine a relatively long read not mapping to the ground truth segment, but for shorter, low-complexity sequences, I see your point. Having a way to easily pass training examples (fast5/fasta tuples) to update an existing high-accuracy model would be nice.

I'm also trying to work through this methodology, but haven't been successful yet.

Good luck.

iiSeymour commented 3 years ago

@ktan8 yes, that's correct.

@jackwadden you can fine-tune a released model with bonito like so -

$ mkdir my-training-data
$ bonito basecaller dna_r9.4.1 --reference ref.mmi --save-ctc my-reads > my-training-data
$ bonito training --epochs 1 --lr 5e-4 --pretrained dna_r9.4.1 --directory my-training-data my-fine-tuned-model
jackwadden commented 3 years ago

Thanks @iiSeymour. I've found that this works better than using the taiyaki prepare_mapped_reads.py script as a training frontend. This is the pipeline I'm using to fine-tune the dna_r9.4.1 model given a set of known "ground truth" sequences.

# index ground truth sequences
minimap2 -x map-ont -d ground_truth_reads.mmi ground_truth_reads.fa                                                                                                              

# basecall and save ctc                                                                                                                                  
bonito basecaller dna_r9.4.1 --reference ground_truth_reads.mmi --save-ctc fast5s-for-ground-truth-reads > bonito_calls.sam                                                          

# mv training data to new model dir                                                                                                                            
mkdir my-training-data                                                                                                                                 
mv *.npy my-training-data                                                                                                                                                                                                                              

# retrain original model with new data                                                                                                                   
bonito train --epochs 1 --lr 5e-4 --pretrained dna_r9.4.1 --directory my-training-data my-fine-tuned-model

Thanks again.

liushuang1001 commented 7 months ago

Hi,can you tell me if the chunks.npy, references.npy, and reference_lengths.npy files are generated in the my-training-data folder when I run to step 2? I run to this shows bash: my-training-data: Is a directory.