slow pairing process - Githubissues

olawa commented 2 years ago

I am trying to re-basecall a large set of cDNA reads (~200M) with duplex but it is just too slow. I am guessing I get too many candidates since many reads will be of similar length (1M candidates from 10M reads, LSK109). Guppy_duplex with guppy6.1.1.

A few questions: 1) are you requiring reads to be from the same pore or is pairing just based on time? 2) read_all_fastq seems like the slowest part, perhaps due to disk i/o. Not sure exactly what this function does but wonder if it could be sped up e.g if reads were split by pore or sorted by genomic locations? 3) Some of the runs are barcoded so pairing needs to be done by sample. Can guppy_duplex be set to split reads by barcode before pairing or would it perhaps be better let guppy_barcoder split fastq and fast5 before doing duplex calling?

olawa commented 2 years ago

Ok, it looks like I was using the ont_guppy_duplex_pipeline which appears not be using the same code as duplex-tools. It looks like it is not mult-threaded at this step. Running duplex-tools now and it is faster at reading the fastq.

cjw85 commented 2 years ago

Hi @olawa,

Unfortunately I can't coomment on ont_guppy_duplex_pipeline. That might be one for @fbrennen.

To explain briefly: duplex-tools first finds candidate pairs of reads that were read consecutively in a channel and are about the same length (I think within 10% of the length of the longer read @ollenordesjo can confirm). The second step performs an alignment of the end of the first read in a putative pair to the start of the second read. If the alignment score is above a threshold the reads are determined to be a true pair and the pair is output for duplex calling.

This setup was designed around genomic DNA fragments, and I think you are correct in suggesting that it may produce false positive pairs for cDNA reads.

On your third question, there is an option --match_barcodes which will require the detected barcodes from guppy_basecaller/guppy_demultiplexer recorded in the sequencing summary file to match for a pair is to be created.

ollenordesjo commented 2 years ago

@cjw85, correct, currently (with default settings), it's required that the length difference between the paired reads is less than 10% or 200bp, whichever is less. It's likely that the 10% number is going to be the more stringent one, so if you happen to end up with a low number of reads, you can adjust this value (https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/pairs_from_summary.py#L310). You may also want to adjust the min_qscore value if the initial basecall was not done with SUP (https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/pairs_from_summary.py#L320)

olawa commented 2 years ago

Thanks @cjw85 and @ollenordesjo, have tried it a bit more also with gDNA now and it is giving some high quality reads even with old R9.4 data. The pairing is working well with duplex-tools but takes days for the guppy_duplex script in the current guppy release (it is at fastq 1340/1700 after a day while the same dataset took only a minutes to pair with duplex-tools) so I hope guppy_duplex can be updated. Feel free to close this issue.

ollenordesjo commented 2 years ago

Thanks @cjw85 and @ollenordesjo, have tried it a bit more also with gDNA now and it is giving some high quality reads even with old R9.4 data. The pairing is working well with duplex-tools but takes days for the guppy_duplex script in the current guppy release (it is at fastq 1340/1700 after a day while the same dataset took only a minutes to pair with duplex-tools) so I hope guppy_duplex can be updated. Feel free to close this issue.

Thanks for feedback on this @olawa! I'll get in touch with the guys behind guppy_duplex and see if we can get a speedup to it.

Cheers

olawa commented 2 years ago

The guppy_duplex run did finish eventually and produced more duplex reads (~30Mb vs 25Mb) for my dataset so I guess there is more that could be done to optimize pairing. Regarding the 10%/200bp hard cut-off, for ultralong reads you may perhaps want to include also cases were the complement is significantly shorter due e.g to single-strand nicks on the DNA.

I think the pairing process in duplex_tools could is supposed to be multithreaded but looks like it is only using a single core. It runs much faster (from hdd) if changed to ThreadPoolExecutor instead of ProcessPoolExecutor (10m vs ~2h). Also working with a concatenated fasta is much faster than using the 5000+ individual fastq's.

Next bottleneck appears to be finding the raw data, maybe you could use the fact that fast5 and fastq (should) have the same basename to make an index for the pairs.

ollenordesjo commented 2 years ago

Hi @olawa, yes, I think for ultra-long reads, it may be suitable to change to 10% cutoff to something like 25% or similar. I do know that the duplex basecalling works better if the reads are of approximately the same length, so that is one reason to be conservative here. No reason to not play around through, definitely possible that being less stringent can recover more data of sufficient quality.

Thanks for noting the multithreading behaviour. I'm wondering if the ThreadPoolExecutor works much faster due to the large number of files? Probably worth taking a look at this.

Regarding finding the raw data, it should be possible to speed things up by first running https://github.com/nanoporetech/ont_fast5_api#fast5_subset with a --read_id_list containing a read_id column to grab the raw data to begin with. Regarding the basename, I agree, this should be a solution, but would not work generally if reads have been shuffled around between fast5s, which is unfortunately always a possibility.

olawa commented 2 years ago

Not sure what causes the difference but it was much faster to run with a single concatenated file. I was able to get good pairings for R9.4 cDNA eventually, not many (~80k from >10M) but quality is great. I shifted the extracted seq 50 bases to avoid adaptor sequence and used 400bp to get reliable pairs. From eyeballing a few reads the second strand is often truncated in the end so it might help to look at a larger window, but should be less of an issue with the new kits I guess.

Thanks, will try the fast5_subset for the promethion runs.

TonyBolger commented 2 years ago

Ok, it looks like I was using the ont_guppy_duplex_pipeline which appears not be using the same code as duplex-tools. It looks like it is not mult-threaded at this step. Running duplex-tools now and it is faster at reading the fastq.

I know it's not actually this project, but there's a fairly trivial fix to get a massive performance improvement in the pair-filtering part of the duplex pipeline (e.g. 89 hours to 16 minutes for one mid-size promethION run).

The problem is not the fastq parsing itself, but the read name lookup - which uses lists rather than sets.

temp_read_ids = list(candidate_pairs["read_id"]) comp_read_ids = list(candidate_pairs["read_id_next"])

then

for file in seq_files: fastx = pyfastx.Fastx(str(file)) for name, sequence, _, _ in fastx: if name in temp_read_ids: seq_ends[(name, 0)] = str(sequence[-bases_to_align:]) if name in comp_read_set: # a read can be in both seq_ends[(name, 1)] = reverse_complement(str(sequence[:bases_to_align]))

These read_id lists are >100k, so lookup time matters.

Here's the diff - candidate_filtering.py.diff.txt

fbrennen commented 2 years ago

Hi @TonyBolger -- thanks very much for the suggestion. We'll have a look at it right away.

mattloose commented 2 years ago

Hi - @fbrennen this still doesn't appear to be fixed in the pipeline as of today - is it going to be fixed? We reduced a wait time of >9 hours to a few minutes by independently implementing the same fix as @TonyBolger proposes above.

iiSeymour commented 2 years ago

Thanks for flagging this @mattloose - @TonyBolger's fix was taken but we haven't pushed a release (sorry!).

@MarkBicknellONT is working on getting a release sorted.

mattloose commented 2 years ago

Thanks - guppy_basecaller_duplex seems to have a similar issue when it loads up. When given a pairing list of 775K length it has been at least 1 hour and it hasn't even started calling yet. Are there any similar fixes there? (Version 6.0.6+8a98bbcbd)?

MarkBicknellONT commented 2 years ago

Hi @mattloose - the long pause at the beginning of processing in guppy_basecaller_duplex is actually a slightly different process happening. When the duplex caller starts up, it needs to locate all the read IDs which have been provided in the pair list. It does this by indexing the input FAST5 files to build up a list of read_id to filename mappings. For large datasets this can take a very long time, especially on network storage. We are currently working in integrating pod5 to the guppy tool suite which will hopefully bring a significant performance improvement to this step. There are also algorithmic improvements to the indexing operation which we can implement in order to allow duplex calling to begin as soon as the first pair has been fully located, instead of waiting for the whole pair list to be processed before starting calling.

mattloose commented 2 years ago

Thanks @MarkBicknellONT - for info this is running on the SSD on promethion.

There is no way it should be this slow. All the index information required is in the sequencing_summary files that have been used to generate the simplex calls - its should be possible to parse those in the order of seconds?

MarkBicknellONT commented 2 years ago

We could certainly parse the simplex sequencing summary if it is available, which would be much faster than reading the ids from the FAST5 files. There are two (perfectly solvable) issues with this approach:

guppy_basecaller_duplex can run independently of the guppy_duplex pipeline, so it can't guarantee that the simplex sequencing summary is actually available. We could add an optional argument to locate it if it exists.
The sequencing summary file does not contain the full relative path to the source file (just the filename itself), so problems come up if there are two files with the same filename in different subfolders of the input folder. This for example happens if fast5subset has been used to select out reads from two or more datasets, and then they have been aggregated for duplex calling. The obvious solution here is to change the summary schema to contain the full file path relative to the input folder.

Anyway - you are quite correct, we can definitely improve things for the common use case here.

mattloose commented 2 years ago

For info - 6504 files containing 775,000 pairs of reads took just under 2 hours to index on the promethION SSD array. Now just for the base calling itself to complete :-)

MarkBicknellONT commented 2 years ago

guppy 6.1.7 has been released this morning, with @TonyBolger 's fix to duplex pairing performance. Thanks for the suggestion! Expect a release note on the community shortly.

olawa commented 1 year ago

Hi @ollenordesjo , updated duplex_tools and it seems the current version still uses ProcessPoolExecutor. Changing this to ThreadPoolExecutor again gave a significant speedup for fastq reading (f00 files) even from SSD. Is there any reason not to use it?

onordesjo commented 1 year ago

Hi @olawa, I don't think there's any reason to not use it. I'll look into it and get it updated unless we see any issues with it. Thanks for poking!

onordesjo commented 1 year ago

Hi again @olawa, just sorted this in v0.2.17, feel free to try it out!

nanoporetech / duplex-tools

slow pairing process #9