Performance ponies - Githubissues

betatim commented 7 years ago

Branching off of #103, a place to discuss ideas, what worked and what didn't work regarding making sourmash fast again.

@lgautier's "python all the way notebook": https://github.com/lgautier/mashing-pumpkins/blob/master/doc/notebooks/MinHash%2C%20design%20and%20performance.ipynb

Python function calls are notoriously slow, so we should see improvement by calling hashfun less often (in batches). Do you think that murmur hash also gains on top of that by computing things in a batch?

lgautier commented 7 years ago

Python function calls are notoriously slow,

Speed is a relative thing, isn't it ?

hashfun is called in batches, but the gain less significant as because the relative time spent on other parts of the code it increasing.

(Note: I also currently have a tiny bug in mashing-pumpkins that is inflating the performance of my code for the smaller batch sizes - https://github.com/lgautier/mashing-pumpkins/issues/2 - working on the fix, should be pushed today).

lgautier commented 7 years ago

Just a note that I fixed the bug and confirm that it was inflating a bit the performances with smaller batch sizes. Yes, speedup appear to be ranging from 2.5 to 3.5 times faster. I have not looked at memory usage.

edit: as pointed out by @betatim , the check of the alphabet in the sequence reverse complement were not performed. The speedup is down to about 1.6 times faster in similar conditions (1MB, 1,000 hash values in the sketch).

lgautier commented 7 years ago

Update:

A simple parallelization strategy (map-reduce) appears to give it back an edge (while doing alphabet check and adding the reverse complement, and while benchmarking using the latest Cythonized in https://github.com/dib-lab/sourmash/commit/a8dd31b16d803908871007a93ca65ba1ac536cf6 ).

Not unexpected, but the gain seems too good to be true: over 10x faster for 1MB, k=21, maxsize=1000.

The correctness of output is tested (part of the unit tests - coverage is way over 90%, and trying to cover edge cases) and, assuming these tests are themselves correct, this would mean that the measurement of time is not correct (although the same method to do so is used everywhere in the benchmark).

In summary, more verifications are needed but in the meantime parallelization looks promising.

ctb commented 7 years ago

nice!

lgautier commented 7 years ago

If this is holding, I/O and parsing aside this would put building a sketch with a maxsize of 20,000 hash values for a full MiSeq v3 run under 12 mins on a 2-core machine (with the larger sizes there might be a gain from using more cores - I'll have to check).

Figure from the notebook. The red lines (one line per buffer size) are for the parallel version with 2 child processes:

note: this was edited. Twice. Performance claims dialed down each time after fixing benchmark code (to anyone interested: don't use time.clock() with multiprocessing)

betatim commented 7 years ago

n.b. sourmash only adds the smaller of the sequence or the revcomp, not both.

lgautier commented 7 years ago

@betatim

n.b. sourmash only adds the smaller of the sequence or the revcomp, not both.

Not sure it makes much of a difference in practice.

Hashing is essentially a sampling strategy, and the odds that the same sub-sequence of length has both hash values for it direct and reverse complement are in the maxsize smallest hash values are limited... if assuming independence between the rank of hash values for a sub-sequence and its reverse complement (since hashing is hoped to be uniform for the space of possible sequences and and the space of possible hash values >> maxsize). This obviously is not going to be completely true (note: one of the ideas what I'd like to explore is related to that - I'd love discuss it), but should this happen it would be restricted to (very) rare sub-sequences / k-mers.

If requiring the assurance that this could never happen, it would (presumably) still go faster, for example by adding a bookkeeping dict on insertion in the sketch (O[1] to check whether a sub-sequence is already in the list, O[1] + O[maxsize] to replace it).

betatim commented 7 years ago

I just wanted to point it out as it means you could save the time for computing the hash for the other sequence.

lgautier commented 7 years ago

Ah. I see.

In that case, and because I don't have anything else in C than hash values computed in small batches, branching might be both more trouble than worth it: the benchmarks for MurmurHash3 and XXHash I read are in the GB/s ranges, which I consider essentially free compared to the rest.

Having that said, you might be right and it should be tested in order to be sure...

edit: I am just thinking that the relative cost of hashing will increase with the k in k-mers and this is not as free as I initially thought. There might actually be visible benefits. I need to think a little about how to implement it without having branching offset these possible benefits and while keeping Python-level flexibility...

re-edit: Scratch that. Hash values must be computed for both in order to know which one has the smallest value. The potential speed gain would on insertion in the minhash, but the potential gain will be proportional to the frequency as which they occur... and it these are very rare events the potential gain is not very much.

lgautier commented 7 years ago

Wrapping this up:

I have hacked together an adapter to sourmash to benchmark things end to end: I am able to measure a 2x to 3x speed up compared to sourmash_lib/master (from FASTA.gz or FASTQ.gz to signature in JSON). It should be possible to go faster from there, but would likely require something like Pypy, or moving more to C (Cython or C/C++).

The benchmark is here: https://gist.github.com/lgautier/fb9e18636cd1a497ecc4fad8c440aaec (requires mashing-pumpkins/master on Python >= 3.5 - if this is too much trouble I added it to a the following Docker container: https://hub.docker.com/r/lgautier/dnasnout-client/) (before you ask, I am checking that the signatures are the same as the ones from sourmash with sourmash search)

betatim commented 7 years ago

Using your docker container I get:

$ time sourmash compute --force --dna  chr1.fa
46s
$ time python3 -m mashingpumpkins.demo.cmdline --format=FASTA --ncpu=2 --maxsize=500 chr1.fa
28s

(if you use the .gz mashingpumpkins slows down by ~2s but sourmash doesn't. Noise?)

With ncpu=2 it seems mashingpumpkins only really uses one CPU? There are two processes but one of them seems to use hardly any CPU time. Does that look right to you?

Another observation is that sourmash uses more memory (~1.2gig) compared to "hundreds of MB" for mashingpumpkins.

lgautier commented 7 years ago

(if you use the .gz mashingpumpkins slows down by ~2s but sourmash doesn't. Noise?)

I am using using Python gzip.open for gzip-compressed files, and open when compression. Either noise or there is room for tuning for what I am doing (the fraction of time spent reading/parsing is likely higher with that demo script than with sourmash - more on this below).

With ncpu=2 it seems mashingpumpkins only really uses one CPU? There are two processes but one of them seems to use hardly any CPU time. Does that look right to you?

The sketches have a pretty interesting property for map-reduce type of parallelizations:

minhash(seq_a + seq_b) = minhash(seq_a) + minhash(seq_b)  

   ^-- reduce task            ^-- map task       ^-- map task

ncpu-1 child processes are spawn to handle map tasks, and the main/parent process is reading/parsing the input files (FASTA or FASTQ), feeding child processes with chunks of parsed data as they become free, and "reducing" the minhash sketches returned by child tasks. Depending on the load represented by each task (parse, compute minhash, merge minhashes), the system can appear using less that the total number of CPUs/cores allocated. Here I think that parsing is the bottleneck and the child processes spend a fraction of their time waiting. Future direction would be to 1) optimize parsing, and if significant improvements can be reached 2) design a more elaborate parallelization architecture where the reduce step is moved to its own process.

Another observation is that sourmash uses more memory (~1.2gig) compared to "hundreds of MB" for mashingpumpkins.

I was not particularly aggressive about memory usage so this is coming as good news (I am paying attention to avoid unnecessary copies in few places though, but thi ss for speed concerns). On the other hand I am not surprised about the generally good memory profile of the approach taken because of past work (about 4 years ago) where we demoed that parsing through FASTQ files in a stream/select k-mers or reads could be done on an android tablet (and a server was identifying the genomic content - that's also why I am using the Docker container defined at https://bitbucket.org/lgautier/dnasnout-client).

lgautier commented 7 years ago

Future direction would be to 1) optimize parsing alone , and if significant improvements can be reached 2) design a more elaborate parallelization architecture where the reduce step is moved to its own process.

I measured the throughput of the parsing alone.

On a gzip-compressed test FASTQ of size ~700MB I get:

5.5MB/s for ngs_plumbing (the parsing library used by mashingpumpkins.demo.cmdline)
3.5MB/s for screed (the parsing library used by sourmash)

While the building of the minhash is measurably faster with mashing-pumpkins than with sourmash_libin a variety of use cases, the parsing appears to be a significant limiting factor and the speed-up observed on the command line might owe a lot to the different parser used.

lgautier commented 7 years ago

The parser is definitely a bottleneck. I pushed performances a little further by putting together a faster FASTQ parser (https://github.com/lgautier/fastq-and-furious). Building a bottom-sketch (1000 hash values, k=31) for a FASTQ of size ~700MB takes now about 1'30" (4 to 6 times faster than current sourmash master).

The benchmark code was updated: https://gist.github.com/lgautier/fb9e18636cd1a497ecc4fad8c440aaec

betatim commented 7 years ago

To copy what mashingpumpkins does we need a hashtable implementation for c++ that offers O(1) lookup. google's densehash or a simple homemade hashtable using open addressing and quadratic probing would do the job. One thing to figure out for our own implementation of a hashmap is what value to use as emtpy-value and tombstone-value. Two ideas:

declare 0 and 2**64-1 illegal values and use them as empty and tombstone
store a struct/pair of uint64 and a flag as value in the hashmap. Use to flag to signal emtpy/tombstone

Not sure I like either right now.

betatim commented 7 years ago

The fasta/q parser used in sourmash comes from https://github.com/dib-lab/screed/blob/master/screed/fasta.py speeding that up would also help khmer as this is shared code.

ctb commented 7 years ago

It's true!

A few notes about screed:

supports streaming I/O
supports autodetection of FASTA, FASTQ, gz, and bz2 formats, in streaming mode as well

lgautier commented 7 years ago

@betatim

To copy what mashingpumpkins does we need a hashtable implementation for c++ that offers O(1) lookup.(...)

Not sure I like either right now.

mashing-pumpkins is a library (with a permissive license) and the code in mashingpumpkins.demo.cmdline shows one way to use it as drop-in replacement to build sourmash JSON signature. May be a better transition strategy ?

Beside that, its speed is believed to just be a nice aside^*. The most effort went into its design to achieve flexibility when working with bottom-sketches (see https://github.com/marbl/Mash/issues/45#issuecomment-274665746 - I could implement an alternative way to build sketches for double stranded genomes and compare performances in very little code).

( * : The design allows incremental speed-up by porting functions to C/C++/Cython )

ctb commented 7 years ago

mashing-pumpkins is a library (with a permissive license) and the code in mashingpumpkins.demo.cmdline shows one way to use it as drop-in replacement to build sourmash JSON signature. May be a better transition strategy ?

While fine in theory, adding dependencies to projects is always a difficult question. Each dependency adds complexity with a lot of potential unpredictability, and this is especially true when adding young projects as dependencies. So if I (we?) seem hesitant that's why :).

Simply including the code in the sourmash repo is an alternative that we should also consider.

lgautier commented 7 years ago

@ctb

A few notes about screed:

supports streaming I/O

supports autodetection of FASTA, FASTQ, gz, and bz2 formats, in streaming mode as well

If going for refactoring things a little, getting the autodetection bit out in its own functions would be awesome. Streaming is a chain-able and this would mean two functions:

detect the compression scheme
detect the format

lgautier commented 7 years ago

While fine in theory, adding dependencies to projects is always a difficult question. Each dependency adds complexity with a lot of potential unpredictability, and this is especially true when adding young projects as dependencies. So if I (we?) seem hesitant that's why :).

That code is definitely young (about 3 weeks old), but short and specialized. That's a low-level library for building top and bottom sketches (MinHash & MaxHash) and derived structures (e.g, CountMinHash), and using them (jaccard index, bray-curtis for count sketch, etc...), as well as relatively well tested (coverage well over 90%).

edit: The code is also only for Python >= 3.5. Possibly a limiting factor if user base stuck with Python 2 (although hard to believe in an age where there is Docker), but not radical (check the projects dropping support for Python 2 - http://www.python3statement.org/) and leading to both a simpler code base and better performances.

Simply including the code in the sourmash repo is an alternative that we should also consider.

Just saying that before embarking into a move to C++ to just copy something existing (with the likely technical debt that monolithic designs represent in research projects), one could get a sense of whether the immediate possible pain point (that is the building of sketches) can be relieved. I hammered a release for the very purpose of allowing that to happen, and the license was chosen so as to allow inclusion (the way you are already including code for MurmurHash3).

lgautier commented 7 years ago

@ctb , @betatim

I cleaned a bit the code and documented the parsing proof-of-concept I wrote for the latest (fastest-running) iteration of the other proof-of-concept (mashing-pumpkins). Modularity is again part of the design, and you may get an instant speed-up of the current sourmash with an adapter for screed (if my benchmark for minhash build are about accurate, you should get 1.5x speedup with 3-4 lines of code (the adapter). All this is here: https://github.com/lgautier/fastq-and-furious

ctb commented 7 years ago

On Mon, Jan 30, 2017 at 06:56:29AM -0800, Laurent Gautier wrote:

@ctb

A few notes about screed:

supports streaming I/O

supports autodetection of FASTA, FASTQ, gz, and bz2 formats, in streaming mode as well

If going for refactoring things a little, getting the autodetection bit out in its own functions would be awesome. Streaming is a chain-able and this would mean two functions:

detect the compression scheme

detect the format

good idea! not currently a priority (that's the 1.0 release) but we will put it on the list.

lgautier commented 7 years ago

One last thing before I have to jump out of this thread. @peterjc pointed out that biopython can be quite fast parsing through FASTQ rather fast (see https://github.com/lgautier/fastq-and-furious/issues/2)... faster than most parsers I tried. fastq-and-furious is a proof-of-concept, so you might want to consider it if a long-established codebase is preferred.

ctb commented 2 years ago

good thread on apache arrow - https://twitter.com/fmic_/status/1570817872477261827 - we should look into conbench - https://github.com/conbench/conbench

sourmash-bio / sourmash

Performance ponies #104

Update: