Closed ctb closed 2 years ago
% time sourmash sketch translate ../bothie/data/genome-s10.fa.gz
== This is sourmash version 4.2.3.dev23+g46ee6550. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
computing signatures for files: ../bothie/data/genome-s10.fa.gz
Computing a total of 1 signature(s).
... reading sequences from ../bothie/data/genome-s10.fa.gz
calculated 1 signatures for 1 sequences in ../bothie/data/genome-s10.fa.gz
saved signature(s) to genome-s10.fa.gz.sig. Note: signature license is CC0.
real 20m8.894s
user 19m58.022s
sys 0m3.855s
20 minutes to run on a 500kb long sequence!?
(file is available in the sourmash source tree as tests/test-data/genome-s10.fa.gz
)
Sketching GCF_018389385.1_ASM1838938v1_genomic.fna
took something like 16 hours using default options!?
Yes - I am noticing that small genomes are slow but manageable, but large genomes are pretty intractable.
(at scaled=1) --of ~9000 genomes in my comparison set, ~7500 finished in <=4 hours (low queue limit), most in <=30mins. I allowed 4 days when running the remaining, which I thought would be overkill, but the jobs timed out.
I decided to abandon scaled=1 for more reasonable scaled values, but now we're at >20 hours for most of these genomes.
sizes currently running:
2.8M GCF_001767235.1_genomic.fna.gz
2.7M GCF_000828075.3_genomic.fna.gz
2.7M GCF_003751225.1_genomic.fna.gz
2.9M GCF_000014905.1_genomic.fna.gz
2.9M GCF_014205215.1_genomic.fna.gz
2.9M GCF_011764445.1_genomic.fna.gz
3.1M GCF_001189295.1_genomic.fna.gz
3.2M GCF_000568255.1_genomic.fna.gz
2.8M GCF_000255295.1_genomic.fna.gz
3.4M GCF_001263205.1_genomic.fna.gz
3.5M GCF_000067165.1_genomic.fna.gz
2.8M GCF_000504285.1_genomic.fna.gz
2.6M GCF_900109545.1_genomic.fna.gz
2.6M GCF_000024805.1_genomic.fna.gz
I've also allowed 4 days ... can post runtime information by genome size when they finish.
These genomes are available in /home/ntpierce/2021-rank-compare/genbank/genomes
I'll note that for genomes, using prodigal to find ORF's and then sketching the proteins is actually (much) faster at the moment 😬
It is SeqToHashes
:
Need to take a closer look to see where it is creating too many small allocations
@luizirber, thanks for getting us a bit closer! Helpfully, you left a note at the beginning of the translate
code: https://github.com/sourmash-bio/sourmash/blob/13bf0d55b5c9c49b0702f223e22617585eff33bb/src/core/src/signature.rs#L294-L295
@mr-eyes and I looked through the code today, and ran a couple small checks. We think the issue (or part of the issue) is with lines 314-315 /333-335 (same code). After translating a single frame, we hash all protein kmers for that frame and store them in a buffer. For long sequences (genomes), this can be quite a lot of hashes. This intuitively makes sense given what @ctb observed above: that read datasets, which have much smaller sequence lines, are relatively fast compared with genomes.
We did a super basic timing test - ran sketch translate
on tests/test-data/genome-s10.fa.gz
using the original code (~2minutes), and using code that added only a single hash instead (instantaneous). Of course reducing hashes also affects other downstream operations, but this seems a good place to start. @mr-eyes suggests than for long sequences only, we could process the hashes in chunks, to avoid having to add too many hashes to the buffer. This seems like a relatively straightforward/small change! Thoughts @luizirber @ctb?
...and now some conjecture/additional thoughts (I ended up thinking about this some more after our chat, Mo!).
I went spelunking in both the dna
and protein
portions of SeqToHashes
, and there it does seem that we return an iterator over the hashes. For protein
, we just keep track with the kmer_index
; for dna
we handle forward and reverse with dna_last_position_check
(storing the reverse complement sequence so we can iterate through both at once), and return the minimum hash between the forward and rc k-mers. The complicating issue with translate is that we have 6 total reading frames we need to handle, and we're currently iterating through those but returning a buffer of all hashes for a given frame.
In order to make an iterator over all hashes, we would have to either store all frames (seems bad/not worth), or, we could store an index of translation (e.g. frame_index
0-5), plus a kmer_index
for where we are in the translated frame, right? Then we could move these two indices as we move through the sequences, resetting kmer_index
to 0 when we run out of kmers on a particular frame, and incrementing the frame_index
. When this happens, we could translate the next frame on the fly, store it as the current frame, and continue iterating kmers across the frame length.
Thanks for suffering through my understanding of how things are working and ideas of how to solve :)
Thanks, Tessa for writing the after-chat summary! Also, the iterator idea seems to be good but I am not sure it will resolve the problem here.
The real issue is the insertion of all the kmers of all the frames then returning them all at once. I think that is how seqToHashes
was expected to work anyway. After some re-thinking about chunks, I think there's a better resolution.
On the Rust side: instead of translating all the frames, modify the function signature to translate a user-defined frame in a user-defined orientation (forward/reverse). That should minimize the number of kmers that need to be inserted in each invocation.
On the Python side: Invoke the Rust seqToHashes(..., frame_number, orientation)
and then get its kmers. So we actually chunked the data but without writing an exhaustive if condition to check the chunk size on each kmer in the Rust side. In Python, we can either yield all the returned kmers into one list OR we can modify the API to be like this:
for frame in [1,2,3]:
for orientation in ['F', 'R']:
seqToHashes(seq, frame, orientation)
What do you think?
And I think this can be easily parallelized on the Python side to translate all the frames.
I'm a bit confused - reading the above, it seems to me like you're saying that it's the total number of hashes being processed that's the problem.
But, if you run sourmash sketch translate
on a file with many small sequences, it is very fast. If you run it on a file with the same number of total bp, but all in one sequence, it takes a really long time. (I haven't controlled for the total number of hashes, but the number is not that different between the two.)
Is the problem that the buffer is something like O(N) for inserting hashes for temporary storage, and everything else is working fine? If so maybe we just need to use a better data structure for temporary hash storage.
using iterators, I would imagine you can just do six different iterations across the whole sequence, each starting at a different frame, if you wanted to mimic what the other hashing is doing, right?
On the Rust side: instead of translating all the frames, modify the function signature to translate a user-defined frame in a user-defined orientation (forward/reverse). That should minimize the number of kmers that need to be inserted in each invocation.
On the Python side: Invoke the Rust
seqToHashes(..., frame_number, orientation)
and then get its kmers. So we actually chunked the data but without writing an exhaustive if condition to check the chunk size on each kmer in the Rust side. In Python, we can either yield all the returned kmers into one list OR we can modify the API to be like this:for frame in [1,2,3]: for orientation in ['F', 'R']: seqToHashes(seq, frame, orientation)
What do you think?
I think I'm basically saying "why not do the double for loop in Rust rather than in Python"? :)
suggested benchmarks for checking optimization results: https://github.com/sourmash-bio/sourmash/commit/8f3194af6b1fc97dba5b6688b015394195b14d6a
I hope this is fixed now in #1938 . Not only this, but the whole seqToHashes()
functionality, especially for long sequences.
Would you please give it a try @ctb @bluegenes ?
My time benchmark for sourmash sketch translate tests/test-data/genome-s10.fa.gz
was: 2:19.25 now: 0:00.40 :rocket:
checked md5sum for both generated sigs
96b2d20e28a503d524f5719e1a7dd72f new_genome-s10.fa.gz.sig
96b2d20e28a503d524f5719e1a7dd72f old_genome-s10.fa.gz.sig
thank you, @mr-eyes!!
@ctb GCF_018389385.1_ASM1838938v1_genomic.fna
now takes around 5 hours. Do you think this is acceptable?
wow, that still seems really long :(. It should take at most about 6 times as long as DNA, right?
wow, that still seems really long :(. It should take at most about 6 times as long as DNA, right?
I expected that. This happens for super long sequences like this one with a length of 4,187,797bp. Will create a new issue and start a PR later for more enhancements.
I'm seeing that
sourmash sketch translate
is taking a Very Long Time - minutes - on a genome-sized data set. But ISTR it runs reasonably fast on read data sets. Maybe I misremember? Anyway, putting this here to remind myself to look into it, and/or to get feedback from @bluegenes.