Speeding up prepare_mapped_reads

mbhall88 commented 4 years ago

Hi,

I have been having some problems trying to get prepare_mapped_reads.py to run in a "reasonable" amount of time. What I mean by reasonable is before my cluster kills the job after running for 1 week.
Throwing more CPUs at it doesn't seem to be having much of an effect either.

Are you able to suggest a way I could speed this up? I have ~1.3 million reads. Should I be more hardcore with my filtering? Or is there a way I can slit this up and merge the resulting HDF5 files at the end?

aevansNP commented 4 years ago

1.3 million is a big set of data - we haven't observed improvement in performance beyond tens of thousands of reads.

In our current implementation the whole training set is held in memory during training, so it's not possible to train from a set this large unless you select a small portion of it (using the options input_strand_list or limit).

There is a script

taiyaki/misc/merge_mapped_signal_files.py

which can be used to join mapped signal files together, so you can process a large set in batches and join them together before training.

marcus1487 commented 4 years ago

Not sure if this is your use case, but improvements have been observed using 100s of thousands of reads when your reads are all very short (<200 bases). As @aevansNP notes, for standard nanopore reads (>1000 bps) this scale of data is not necessary to train a robust model.

mbhall88 commented 4 years ago

Interesting. My use-case is a Mycobacterium tuberculosis model. The read lengths fall somewhere in between your two noted ranges. median is generally ~3-4Kbp but we do see quite a lot of reads below 1Kbp.

Even so, sounds like 1.3 million reads is a bit extreme. I will do some heavy filtering. Also great to know about the merge script thanks @aevansNP

marcus1487 commented 4 years ago

@mbhall88 An alternative here would be to use megalodon with --output signal_mappings. This feature is not yet well documented and will give signal mappings that are quite different from taiyaki (due to the methods involved), but should be much quicker (not much slower than standard guppy basecalling). See megalodon --help-long for more options controlling this signal_mappings output.

mbhall88 commented 4 years ago

Thanks @marcus1487 .
When you say "signal mappings that are quite different from taiyaki" what do you mean? Less or more accurate?

In the meantime, I have just randomly subsampled by reads to a more manageable amount and then I'll just use the rest for evaluation.

marcus1487 commented 4 years ago

Accuracy is hard to measure here. Taiyaki performs a squiggle space Viterbi mapping through the posteriors produced by the basecaller. Megalodon performs mappings by retaining the link between the signal, through basecalls and onto the reference mapping. Thus Megalodon mappings will contain any skipped bases in the base space mapping, while the Viterbi mapping done by Taiyaki will force all bases to be assigned to a segment of signal. These are different, but assessing accuracy of these signal assignment methods is difficult.

organic-chemistry commented 3 years ago

Hello, I tried to use your work around

megalodon h5rep --outputs signal_mappings --reference myref.fa --processes 6 --do-not-use-guppy-server --taiyaki-model-filename ../taiyaki/models/mLstm_flipflop_model_r941_DNA.checkpoint --device cuda:0

it works but i get a lot of: Likely out of memory error. See log for details. (about 10 % of my reads) I tried running it with only one process, or only on cpu but it does not change anything

Going to the log show this message which is a bit strange:

Likely out of memory error: Input type (torch.cuda.DoubleTensor) and weight type (torch.cuda.FloatTensor) should be the same --- ReadWorker000-MainThread backends.py:747 DBG 17:09:54 : 216da95e-5512-45b1-80a9-854879353f17 Failed Likely out of memory error. See log for details

Do you have any idea? Thank you

organic-chemistry commented 3 years ago

Also is there a way to do as with prepare_map and give it a fasta with a list of read, each having its sequence already alligned? Best

nanoporetech / taiyaki

Speeding up prepare_mapped_reads #73