nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
396 stars 123 forks source link

how to best fine tune a model with different references #386

Closed rainwala closed 7 months ago

rainwala commented 8 months ago

Hello,

I'm trying to fine tune a model with large repeat expansions (>500 hexamer repeats), where the standard model hallucinates biologically impossible repeats. The issue is that the number of repeats in the dataset varies (i.e. mosaicism). I've had some success taking a subset of the reads that cluster around a specific repeat number, creating a synthetic reference based on that number of repeats, and fine-tuning a standard model as described here: https://github.com/nanoporetech/bonito/issues/137#issuecomment-809739431

However, this approach only uses a subset of the data. I can cluster all the reads around different repeat lengths, and create multiple synthetic references for each cluster, but what is the best way to fine tune using that dataset? Should I perform multiple sequenctial fine-tunings, or is there a way to combine the data and do a single fine-tuning?

rainwala commented 7 months ago

nevermind, I think I've found a way forward.

rainwala commented 7 months ago

Actually, it turns out I only figured out how to combine the "chunks.npy" files using numpy.vstack from several different calls to the basecaller with different references. I couldn't figure out how to combine "references.npy" or "reference_lengths.npy" as they all have different shapes. Any thoughts?

rainwala commented 7 months ago

I figured out how to do it after looking at what references.npy and reference_lengths.npy actually represent.

alexweisberg commented 6 months ago

Hi @rainwala , I was wondering if you would be willing to share your scripts for combining each of these files from multiple runs? we are running into similar issues. Thanks!

alexweisberg commented 6 months ago

Nevermind, I did some sleuthing and figured it out. As you had mentioned, it took some digging into what each of these arrays mean. For anyone else trying to do this in the future:

import numpy import itertools call1chunks = numpy.load("./call1/chunks.npy") call2chunks = numpy.load("./call2/chunks.npy") call3chunks = numpy.load("./call3/chunks.npy") chunksmerge = numpy.vstack((call1chunks,call2chunks,call3chunks)) numpy.save("merged/chunks.npy",chunksmerge)

call1ref = numpy.load("./call1/references.npy") call2ref = numpy.load("./call2/references.npy") call3ref = numpy.load("./call3/references.npy")

longest = max(call1ref.shape[1],call2ref.shape[1],call3ref.shape[1]) call1refpad = numpy.array([numpy.pad(row, (0, longest-len(row))) for row in call1ref]) call2refpad = numpy.array([numpy.pad(row, (0, longest-len(row))) for row in call2ref]) call3refpad = numpy.array([numpy.pad(row, (0, longest-len(row))) for row in call3ref]) stackpadded = numpy.vstack((call1refpad,call2refpad,call3refpad)) numpy.save("merged/references.npy",stackpadded)

call1reflen = numpy.load("./call1/reference_lengths.npy") call2reflen = numpy.load("./call2/reference_lengths.npy") call3reflen = numpy.load("./call3/reference_lengths.npy") mergedreflen = numpy.concatenate((call1reflen,call2reflen,call3reflen)) numpy.save("merged/reference_lengths.npy",mergedreflen)

rainwala commented 6 months ago

Hi @alexweisberg , I had been meaning to get back to you; your original message came in around dinner time for me (I'm in London) and then when I woke up I saw that you'd figured it out! I was happy when I figured it out myself too.

For what it's worth, here's my code, to be run after having already run the first bonito step with save_ctc into several subdirectories of "base_data_dir" in my code:

import numpy as np
import sys
import glob

base_data_dir = sys.argv[1]

## combine the chunk arrays
chunks_filename = 'chunks'
chunks_pattern = f"{base_data_dir}/*/{chunks_filename}.npy"
chunks_array = np.vstack( [np.load(file_path) for file_path in glob.glob(chunks_pattern)] )
np.save(f"{base_data_dir}/{chunks_filename}.npy",chunks_array)

## combine the reference_lengths arrays
ref_len_filename = 'reference_lengths'
ref_len_pattern = f"{base_data_dir}/*/{ref_len_filename}.npy"
ref_len_array = np.concatenate( [np.load(file_path) for file_path in glob.glob(ref_len_pattern)] )
np.save(f"{base_data_dir}/{ref_len_filename}.npy",ref_len_array)

## combine the reference arrays
max_ref_len = int(np.max(ref_len_array))
ref_filename = 'references'
ref_pattern = f"{base_data_dir}/*/{ref_filename}.npy"
old_ref_array = [np.load(file_path) for file_path in glob.glob(ref_pattern)]
max_ref_len = int(np.max([x.shape[1] for x in old_ref_array]))

ref_array = []
for old_array in old_ref_array:
  # zerofill the new array to 'max_ref_len' in the second dimension
  new_array = np.zeros((old_array.shape[0],max_ref_len),dtype=old_array.dtype)
  new_array[:, :old_array.shape[1]] = old_array
  ref_array.append(new_array)
ref_array = np.vstack( ref_array )
np.save(f"{base_data_dir}/{ref_filename}.npy",ref_array)
rainwala commented 6 months ago

@alexweisberg what are you working on BTW?

alexweisberg commented 6 months ago

@rainwala thank you! It ended up being fun figuring out what each of the components stood for and meant, so it was good practice. Thanks for sharing your code too.

I am working on bacteria, trying to see if we can move to just using nanopore reads for assembly and SNP calling. The latest basecallers are very close (10-50 assembly errors per genome) but I wanted to see if I could improve it by fine-tuning a model with one of our datasets with Illumina and Nanopore reads.

It was a little surprising, bonito really struggled to basecall a single barcode from a 96 sample P2 solo run. It had about 1.5 million reads and I had to split it in three. We were running it on a machine with an NVIDIA 6000 Ada but it never used more than ~10 Gb of RAM and still crashed with 'Killed'.

alexweisberg commented 6 months ago

@iiSeymour I was wondering if there is a recommended strategy for selecting input data for fine-tuning a preexisting model? I fine-tuned the latest 4.3.0 sup model with my additional training data, however when I basecall another sample with dorado using the fine-tuned model, the error rate is really poor (30%) relative to the standard, un-finetuned model.

The strategy I used was to take a bacterial DNA sequencing dataset that was sequenced with both Nanopore and Illumina reads. The nanopore reads have an N50 of 7519 bp. I used the hybrid assembled (Unicycler) genome assembly as the reference for bonito basecaller with the recommended options from the readme and the nanopore reads:

bonito basecaller dna_r10.4.1_e8.2_400bps_sup@v4.3.0 --save-ctc --reference CG440_hybrid.fna ./CG440_nano_reads1/ > calls1/basecalls.sam

The dataset was split into three parts and basecalled separately, then merged using the above described scripts. Then I used bonito train to fine tune the model:

bonito train --epochs 1 --lr 5e-4 --pretrained dna_r10.4.1_e8.2_400bps_sup@v4.3.0 --directory ./merged/ ./dna_fine-tuned

bonito export --format dorado ./dna_fine-tuned

Then I basecalled a separate strain using dorado 0.5.0. The resulting read quality was very poor and could not assemble, whereas basecalling with the standard model results in reads that assemble into a genome with only 44 errors.

rainwala commented 6 months ago

Hi @alexweisberg that sounds interesting! I've been working on projects where we use nanopore reads with illumina reads for fungal genome assembly, but never ONT reads by themselves. That being said, there are protocols these days even for de novo human assembly using only ONT reads, but they seem to require multiple flowcells.

In terms of fine-tuning the basecaller, perhaps you might have more joy just waiting for the "high-duplex" prom flowcells ONT is supposed to be releasing soon, where ~70% of the reads will be duplex vs. the ~10-20% we get on regular flowcells. That should bump up the per-read accuracy significantly.

iiSeymour commented 6 months ago

We were running it on a machine with an NVIDIA 6000 Ada but it never used more than ~10 Gb of RAM and still crashed with 'Killed'.

@alexweisberg 'Killed' is the system out of memory killer and not related to the GPU. This is because generating training data with --save-ctc, the chunks are simply accumulated in memory and serialised once at the end of the run.

Your fine tuning strategy is correct. I am almost certain your model is just not being loaded/ran correctly in dorado, can you basecall some data in bonito first and check the error rate.

Are you using native DNA (not PCR)?

You may already know (but just incase), I see you are using Unicycler, you might want to try https://github.com/rrwick/Trycycler and check @rrwick blog https://rrwick.github.io/

alexweisberg commented 6 months ago

Thank you @iiSeymour We tried basecalling with v4.3.0 and with v4.3.0_finetuned with both dorado and bonito for the test dataset, and got the same results with both. With the standard v4.3.0 sup model, both dorado and bonito produce reads around Q20. With the finetuned model, both dorado and bonito produce reads with <Q8 quality. So it doesn't seem to be a bonito vs dorado thing. We trained the finetuned model using an Agrobacterium strain, and tested it out on a different Agrobacterium strain. bonito v4.3.0 sup: bonito_4 3 0_LengthvsQuality bonito v4.3.0 finetuned: bonito_finetuned_LengthvsQuality dorado v4.3.0 sup: dorado_4 3 0_LengthvsQuality dorado finetuned: dorado_finetuned_LengthvsQuality

Thanks also for the heads up on trycycler. It is a powerful tool but we often work with hundreds of strains, so we wanted to use something more automated. We tend to use flye (+ polypolish when we have Illumina reads) or unicycler when we have hybrid data and low coverage of nanopore reads.