Open betatim opened 7 years ago
Python function calls are notoriously slow,
Speed is a relative thing, isn't it ?
hashfun
is called in batches, but the gain less significant as because the relative time spent on other parts of the code it increasing.
(Note: I also currently have a tiny bug in mashing-pumpkins
that is inflating the performance of my code for the smaller batch sizes - https://github.com/lgautier/mashing-pumpkins/issues/2 - working on the fix, should be pushed today).
Just a note that I fixed the bug and confirm that it was inflating a bit the performances with smaller batch sizes. Yes, speedup appear to be ranging from 2.5 to 3.5 times faster. I have not looked at memory usage.
edit: as pointed out by @betatim , the check of the alphabet in the sequence reverse complement were not performed. The speedup is down to about 1.6 times faster in similar conditions (1MB, 1,000 hash values in the sketch).
A simple parallelization strategy (map-reduce) appears to give it back an edge (while doing alphabet check and adding the reverse complement, and while benchmarking using the latest Cythonized in https://github.com/dib-lab/sourmash/commit/a8dd31b16d803908871007a93ca65ba1ac536cf6 ).
Not unexpected, but the gain seems too good to be true: over 10x faster for 1MB, k=21, maxsize=1000.
The correctness of output is tested (part of the unit tests - coverage is way over 90%, and trying to cover edge cases) and, assuming these tests are themselves correct, this would mean that the measurement of time is not correct (although the same method to do so is used everywhere in the benchmark).
In summary, more verifications are needed but in the meantime parallelization looks promising.
nice!
If this is holding, I/O and parsing aside this would put building a sketch with a maxsize of 20,000 hash values for a full MiSeq v3 run under 12 mins on a 2-core machine (with the larger sizes there might be a gain from using more cores - I'll have to check).
Figure from the notebook. The red lines (one line per buffer size) are for the parallel version with 2 child processes:
note: this was edited. Twice. Performance claims dialed down each time after fixing benchmark code (to anyone interested: don't use time.clock()
with multiprocessing)
n.b. sourmash only adds the smaller of the sequence or the revcomp, not both.
@betatim
n.b. sourmash only adds the smaller of the sequence or the revcomp, not both.
Not sure it makes much of a difference in practice.
Hashing is essentially a sampling strategy, and the odds that the same sub-sequence of length has both hash values for it direct and reverse complement are in the maxsize
smallest hash values are limited... if assuming independence between the rank of hash values for a sub-sequence and its reverse complement (since hashing is hoped to be uniform for the space of possible sequences and and the space of possible hash values >> maxsize
). This obviously is not going to be completely true (note: one of the ideas what I'd like to explore is related to that - I'd love discuss it), but should this happen it would be restricted to (very) rare sub-sequences / k-mers.
If requiring the assurance that this could never happen, it would (presumably) still go faster, for example by adding a bookkeeping dict
on insertion in the sketch (O[1]
to check whether a sub-sequence is already in the list, O[1]
+ O[maxsize]
to replace it).
I just wanted to point it out as it means you could save the time for computing the hash for the other sequence.
Ah. I see.
In that case, and because I don't have anything else in C than hash values computed in small batches, branching might be both more trouble than worth it: the benchmarks for MurmurHash3 and XXHash I read are in the GB/s ranges, which I consider essentially free compared to the rest.
Having that said, you might be right and it should be tested in order to be sure...
edit: I am just thinking that the relative cost of hashing will increase with the k
in k-mers and this is not as free as I initially thought. There might actually be visible benefits. I need to think a little about how to implement it without having branching offset these possible benefits and while keeping Python-level flexibility...
re-edit: Scratch that. Hash values must be computed for both in order to know which one has the smallest value. The potential speed gain would on insertion in the minhash, but the potential gain will be proportional to the frequency as which they occur... and it these are very rare events the potential gain is not very much.
Wrapping this up:
I have hacked together an adapter to sourmash to benchmark things end to end: I am able to measure a 2x to 3x speed up compared to sourmash_lib/master
(from FASTA.gz or FASTQ.gz to signature in JSON). It should be possible to go faster from there, but would likely require something like Pypy, or moving more to C (Cython or C/C++).
The benchmark is here:
https://gist.github.com/lgautier/fb9e18636cd1a497ecc4fad8c440aaec
(requires mashing-pumpkins/master
on Python >= 3.5 - if this is too much trouble I added it to a the following Docker container: https://hub.docker.com/r/lgautier/dnasnout-client/)
(before you ask, I am checking that the signatures are the same as the ones from sourmash with sourmash search
)
Using your docker container I get:
$ time sourmash compute --force --dna chr1.fa
46s
$ time python3 -m mashingpumpkins.demo.cmdline --format=FASTA --ncpu=2 --maxsize=500 chr1.fa
28s
(if you use the .gz mashingpumpkins slows down by ~2s but sourmash doesn't. Noise?)
With ncpu=2 it seems mashingpumpkins only really uses one CPU? There are two processes but one of them seems to use hardly any CPU time. Does that look right to you?
Another observation is that sourmash uses more memory (~1.2gig) compared to "hundreds of MB" for mashingpumpkins.
(if you use the .gz mashingpumpkins slows down by ~2s but sourmash doesn't. Noise?)
I am using using Python gzip.open
for gzip-compressed files, and open
when compression. Either noise or there is room for tuning for what I am doing (the fraction of time spent reading/parsing is likely higher with that demo script than with sourmash
- more on this below).
With ncpu=2 it seems mashingpumpkins only really uses one CPU? There are two processes but one of them seems to use hardly any CPU time. Does that look right to you?
The sketches have a pretty interesting property for map-reduce type of parallelizations:
minhash(seq_a + seq_b) = minhash(seq_a) + minhash(seq_b)
^-- reduce task ^-- map task ^-- map task
ncpu-1
child processes are spawn to handle map tasks, and the main/parent process is reading/parsing the input files (FASTA or FASTQ), feeding child processes with
chunks of parsed data as they become free, and "reducing" the minhash sketches returned by child tasks. Depending on the load represented by each task (parse, compute minhash, merge minhashes), the system can appear using less that the total number of CPUs/cores allocated. Here I think that parsing is the bottleneck and the child processes spend a fraction of their time waiting. Future direction would be to 1) optimize parsing, and if significant improvements can be reached 2) design a more elaborate parallelization architecture where the reduce step is moved to its own process.
Another observation is that sourmash uses more memory (~1.2gig) compared to "hundreds of MB" for mashingpumpkins.
I was not particularly aggressive about memory usage so this is coming as good news (I am paying attention to avoid unnecessary copies in few places though, but thi ss for speed concerns). On the other hand I am not surprised about the generally good memory profile of the approach taken because of past work (about 4 years ago) where we demoed that parsing through FASTQ files in a stream/select k-mers or reads could be done on an android tablet (and a server was identifying the genomic content - that's also why I am using the Docker container defined at https://bitbucket.org/lgautier/dnasnout-client).
Future direction would be to 1) optimize parsing alone , and if significant improvements can be reached 2) design a more elaborate parallelization architecture where the reduce step is moved to its own process.
I measured the throughput of the parsing alone.
On a gzip-compressed test FASTQ of size ~700MB I get:
5.5MB/s for ngs_plumbing
(the parsing library used by mashingpumpkins.demo.cmdline
)
3.5MB/s for screed
(the parsing library used by sourmash
)
While the building of the minhash is measurably faster with mashing-pumpkins
than with sourmash_lib
in a variety of use cases, the parsing appears to be a significant limiting factor and the speed-up observed on the command line might owe a lot to the different parser used.
The parser is definitely a bottleneck. I pushed performances a little further by putting together a faster FASTQ parser (https://github.com/lgautier/fastq-and-furious). Building a bottom-sketch (1000 hash values, k=31) for a FASTQ of size ~700MB takes now about 1'30" (4 to 6 times faster than current sourmash master).
The benchmark code was updated: https://gist.github.com/lgautier/fb9e18636cd1a497ecc4fad8c440aaec
To copy what mashingpumpkins does we need a hashtable implementation for c++ that offers O(1) lookup. google's densehash or a simple homemade hashtable using open addressing and quadratic probing would do the job. One thing to figure out for our own implementation of a hashmap is what value to use as emtpy-value and tombstone-value. Two ideas:
Not sure I like either right now.
The fasta/q parser used in sourmash comes from https://github.com/dib-lab/screed/blob/master/screed/fasta.py speeding that up would also help khmer as this is shared code.
It's true!
A few notes about screed:
@betatim
To copy what mashingpumpkins does we need a hashtable implementation for c++ that offers O(1) lookup.(...)
Not sure I like either right now.
mashing-pumpkins
is a library (with a permissive license) and the code in mashingpumpkins.demo.cmdline
shows one way to use it as drop-in replacement to build sourmash JSON signature. May be a better transition strategy ?
Beside that, its speed is believed to just be a nice aside*. The most effort went into its design to achieve flexibility when working with bottom-sketches (see https://github.com/marbl/Mash/issues/45#issuecomment-274665746 - I could implement an alternative way to build sketches for double stranded genomes and compare performances in very little code).
( * : The design allows incremental speed-up by porting functions to C/C++/Cython )
mashing-pumpkins
is a library (with a permissive license) and the code inmashingpumpkins.demo.cmdline
shows one way to use it as drop-in replacement to build sourmash JSON signature. May be a better transition strategy ?
While fine in theory, adding dependencies to projects is always a difficult question. Each dependency adds complexity with a lot of potential unpredictability, and this is especially true when adding young projects as dependencies. So if I (we?) seem hesitant that's why :).
Simply including the code in the sourmash repo is an alternative that we should also consider.
@ctb
A few notes about screed:
- supports streaming I/O
- supports autodetection of FASTA, FASTQ, gz, and bz2 formats, in streaming mode as well
If going for refactoring things a little, getting the autodetection bit out in its own functions would be awesome. Streaming is a chain-able and this would mean two functions:
While fine in theory, adding dependencies to projects is always a difficult question. Each dependency adds complexity with a lot of potential unpredictability, and this is especially true when adding young projects as dependencies. So if I (we?) seem hesitant that's why :).
That code is definitely young (about 3 weeks old), but short and specialized. That's a low-level library for building top and bottom sketches (MinHash & MaxHash) and derived structures (e.g, CountMinHash), and using them (jaccard index, bray-curtis for count sketch, etc...), as well as relatively well tested (coverage well over 90%).
edit: The code is also only for Python >= 3.5. Possibly a limiting factor if user base stuck with Python 2 (although hard to believe in an age where there is Docker), but not radical (check the projects dropping support for Python 2 - http://www.python3statement.org/) and leading to both a simpler code base and better performances.
Simply including the code in the sourmash repo is an alternative that we should also consider.
Just saying that before embarking into a move to C++ to just copy something existing (with the likely technical debt that monolithic designs represent in research projects), one could get a sense of whether the immediate possible pain point (that is the building of sketches) can be relieved. I hammered a release for the very purpose of allowing that to happen, and the license was chosen so as to allow inclusion (the way you are already including code for MurmurHash3).
@ctb , @betatim
I cleaned a bit the code and documented the parsing proof-of-concept I wrote for the latest (fastest-running) iteration of the other proof-of-concept (mashing-pumpkins
). Modularity is again part of the design, and you may get an instant speed-up of the current sourmash
with an adapter for screed
(if my benchmark for minhash build are about accurate, you should get 1.5x speedup with 3-4 lines of code (the adapter). All this is here: https://github.com/lgautier/fastq-and-furious
On Mon, Jan 30, 2017 at 06:56:29AM -0800, Laurent Gautier wrote:
@ctb
A few notes about screed:
- supports streaming I/O
- supports autodetection of FASTA, FASTQ, gz, and bz2 formats, in streaming mode as well
If going for refactoring things a little, getting the autodetection bit out in its own functions would be awesome. Streaming is a chain-able and this would mean two functions:
- detect the compression scheme
- detect the format
good idea! not currently a priority (that's the 1.0 release) but we will put it on the list.
One last thing before I have to jump out of this thread. @peterjc pointed out that biopython can be quite fast parsing through FASTQ rather fast (see https://github.com/lgautier/fastq-and-furious/issues/2)... faster than most parsers I tried. fastq-and-furious is a proof-of-concept, so you might want to consider it if a long-established codebase is preferred.
good thread on apache arrow - https://twitter.com/fmic_/status/1570817872477261827 - we should look into conbench - https://github.com/conbench/conbench
Branching off of #103, a place to discuss ideas, what worked and what didn't work regarding making
sourmash
fast again.@lgautier's "python all the way notebook": https://github.com/lgautier/mashing-pumpkins/blob/master/doc/notebooks/MinHash%2C%20design%20and%20performance.ipynb
Python function calls are notoriously slow, so we should see improvement by calling
hashfun
less often (in batches). Do you think that murmur hash also gains on top of that by computing things in a batch?