sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

is `sourmash sketch translate` really slow on long sequences? #1746

Closed ctb closed 2 years ago

ctb commented 3 years ago

I'm seeing that sourmash sketch translate is taking a Very Long Time - minutes - on a genome-sized data set. But ISTR it runs reasonably fast on read data sets. Maybe I misremember? Anyway, putting this here to remind myself to look into it, and/or to get feedback from @bluegenes.

ctb commented 3 years ago

see https://github.com/NCBI-Codeathons/bothie/pull/35/

ctb commented 3 years ago
% time sourmash sketch translate ../bothie/data/genome-s10.fa.gz

== This is sourmash version 4.2.3.dev23+g46ee6550. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

computing signatures for files: ../bothie/data/genome-s10.fa.gz
Computing a total of 1 signature(s).
... reading sequences from ../bothie/data/genome-s10.fa.gz
calculated 1 signatures for 1 sequences in ../bothie/data/genome-s10.fa.gz
saved signature(s) to genome-s10.fa.gz.sig. Note: signature license is CC0.

real    20m8.894s
user    19m58.022s
sys     0m3.855s

20 minutes to run on a 500kb long sequence!?

ctb commented 3 years ago

(file is available in the sourmash source tree as tests/test-data/genome-s10.fa.gz)

ctb commented 2 years ago

Sketching GCF_018389385.1_ASM1838938v1_genomic.fna took something like 16 hours using default options!?

bluegenes commented 2 years ago

Yes - I am noticing that small genomes are slow but manageable, but large genomes are pretty intractable.

(at scaled=1) --of ~9000 genomes in my comparison set, ~7500 finished in <=4 hours (low queue limit), most in <=30mins. I allowed 4 days when running the remaining, which I thought would be overkill, but the jobs timed out.

I decided to abandon scaled=1 for more reasonable scaled values, but now we're at >20 hours for most of these genomes.

sizes currently running:

2.8M  GCF_001767235.1_genomic.fna.gz
2.7M GCF_000828075.3_genomic.fna.gz
2.7M GCF_003751225.1_genomic.fna.gz
2.9M GCF_000014905.1_genomic.fna.gz
2.9M GCF_014205215.1_genomic.fna.gz
2.9M GCF_011764445.1_genomic.fna.gz
3.1M GCF_001189295.1_genomic.fna.gz
3.2M GCF_000568255.1_genomic.fna.gz
2.8M GCF_000255295.1_genomic.fna.gz
3.4M GCF_001263205.1_genomic.fna.gz
3.5M GCF_000067165.1_genomic.fna.gz
2.8M GCF_000504285.1_genomic.fna.gz
2.6M GCF_900109545.1_genomic.fna.gz
2.6M GCF_000024805.1_genomic.fna.gz

I've also allowed 4 days ... can post runtime information by genome size when they finish.

These genomes are available in /home/ntpierce/2021-rank-compare/genbank/genomes

bluegenes commented 2 years ago

I'll note that for genomes, using prodigal to find ORF's and then sketching the proteins is actually (much) faster at the moment 😬

luizirber commented 2 years ago

It is SeqToHashes: image Need to take a closer look to see where it is creating too many small allocations

bluegenes commented 2 years ago

@luizirber, thanks for getting us a bit closer! Helpfully, you left a note at the beginning of the translate code: https://github.com/sourmash-bio/sourmash/blob/13bf0d55b5c9c49b0702f223e22617585eff33bb/src/core/src/signature.rs#L294-L295

@mr-eyes and I looked through the code today, and ran a couple small checks. We think the issue (or part of the issue) is with lines 314-315 /333-335 (same code). After translating a single frame, we hash all protein kmers for that frame and store them in a buffer. For long sequences (genomes), this can be quite a lot of hashes. This intuitively makes sense given what @ctb observed above: that read datasets, which have much smaller sequence lines, are relatively fast compared with genomes.

https://github.com/sourmash-bio/sourmash/blob/13bf0d55b5c9c49b0702f223e22617585eff33bb/src/core/src/signature.rs#L314-L316

We did a super basic timing test - ran sketch translate on tests/test-data/genome-s10.fa.gz using the original code (~2minutes), and using code that added only a single hash instead (instantaneous). Of course reducing hashes also affects other downstream operations, but this seems a good place to start. @mr-eyes suggests than for long sequences only, we could process the hashes in chunks, to avoid having to add too many hashes to the buffer. This seems like a relatively straightforward/small change! Thoughts @luizirber @ctb?

...and now some conjecture/additional thoughts (I ended up thinking about this some more after our chat, Mo!).

I went spelunking in both the dna and protein portions of SeqToHashes, and there it does seem that we return an iterator over the hashes. For protein, we just keep track with the kmer_index; for dna we handle forward and reverse with dna_last_position_check (storing the reverse complement sequence so we can iterate through both at once), and return the minimum hash between the forward and rc k-mers. The complicating issue with translate is that we have 6 total reading frames we need to handle, and we're currently iterating through those but returning a buffer of all hashes for a given frame.

In order to make an iterator over all hashes, we would have to either store all frames (seems bad/not worth), or, we could store an index of translation (e.g. frame_index 0-5), plus a kmer_index for where we are in the translated frame, right? Then we could move these two indices as we move through the sequences, resetting kmer_index to 0 when we run out of kmers on a particular frame, and incrementing the frame_index. When this happens, we could translate the next frame on the fly, store it as the current frame, and continue iterating kmers across the frame length.

Thanks for suffering through my understanding of how things are working and ideas of how to solve :)

mr-eyes commented 2 years ago

Thanks, Tessa for writing the after-chat summary! Also, the iterator idea seems to be good but I am not sure it will resolve the problem here.

The real issue is the insertion of all the kmers of all the frames then returning them all at once. I think that is how seqToHashes was expected to work anyway. After some re-thinking about chunks, I think there's a better resolution.

On the Rust side: instead of translating all the frames, modify the function signature to translate a user-defined frame in a user-defined orientation (forward/reverse). That should minimize the number of kmers that need to be inserted in each invocation.

On the Python side: Invoke the Rust seqToHashes(..., frame_number, orientation) and then get its kmers. So we actually chunked the data but without writing an exhaustive if condition to check the chunk size on each kmer in the Rust side. In Python, we can either yield all the returned kmers into one list OR we can modify the API to be like this:

for frame in [1,2,3]:
    for orientation in ['F', 'R']:
        seqToHashes(seq, frame, orientation)

What do you think?

mr-eyes commented 2 years ago

And I think this can be easily parallelized on the Python side to translate all the frames.

ctb commented 2 years ago

I'm a bit confused - reading the above, it seems to me like you're saying that it's the total number of hashes being processed that's the problem.

But, if you run sourmash sketch translate on a file with many small sequences, it is very fast. If you run it on a file with the same number of total bp, but all in one sequence, it takes a really long time. (I haven't controlled for the total number of hashes, but the number is not that different between the two.)

Is the problem that the buffer is something like O(N) for inserting hashes for temporary storage, and everything else is working fine? If so maybe we just need to use a better data structure for temporary hash storage.

ctb commented 2 years ago

using iterators, I would imagine you can just do six different iterations across the whole sequence, each starting at a different frame, if you wanted to mimic what the other hashing is doing, right?

ctb commented 2 years ago

On the Rust side: instead of translating all the frames, modify the function signature to translate a user-defined frame in a user-defined orientation (forward/reverse). That should minimize the number of kmers that need to be inserted in each invocation.

On the Python side: Invoke the Rust seqToHashes(..., frame_number, orientation) and then get its kmers. So we actually chunked the data but without writing an exhaustive if condition to check the chunk size on each kmer in the Rust side. In Python, we can either yield all the returned kmers into one list OR we can modify the API to be like this:

for frame in [1,2,3]:
  for orientation in ['F', 'R']:
      seqToHashes(seq, frame, orientation)

What do you think?

I think I'm basically saying "why not do the double for loop in Rust rather than in Python"? :)

luizirber commented 2 years ago

suggested benchmarks for checking optimization results: https://github.com/sourmash-bio/sourmash/commit/8f3194af6b1fc97dba5b6688b015394195b14d6a

mr-eyes commented 2 years ago

I hope this is fixed now in #1938 . Not only this, but the whole seqToHashes() functionality, especially for long sequences. Would you please give it a try @ctb @bluegenes ?

My time benchmark for sourmash sketch translate tests/test-data/genome-s10.fa.gz

was: 2:19.25 now: 0:00.40 :rocket:

checked md5sum for both generated sigs

96b2d20e28a503d524f5719e1a7dd72f  new_genome-s10.fa.gz.sig
96b2d20e28a503d524f5719e1a7dd72f  old_genome-s10.fa.gz.sig
ctb commented 2 years ago

thank you, @mr-eyes!!

mr-eyes commented 2 years ago

@ctb GCF_018389385.1_ASM1838938v1_genomic.fna now takes around 5 hours. Do you think this is acceptable?

ctb commented 2 years ago

wow, that still seems really long :(. It should take at most about 6 times as long as DNA, right?

mr-eyes commented 2 years ago

wow, that still seems really long :(. It should take at most about 6 times as long as DNA, right?

I expected that. This happens for super long sequences like this one with a length of 4,187,797bp. Will create a new issue and start a PR later for more enhancements.