jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
401 stars 200 forks source link

Cythonize parser #25

Closed arq5x closed 12 years ago

arq5x commented 12 years ago

As expected, looping through the samples in _parse_samples is rather slow for files with many samples (e.g., [1]). It seems that the main bottlenecks are Python's split() and the fact that looping in Python is known to be dreadfully slow.

One potential solution would be to port the slowest looping bits to Cython. Another would be to work with BCF.

Thoughts? I think dealing with deep and wide files will be crucial for the future.

[1] ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz

martijnvermaat commented 12 years ago

Agreed, this is important. But it should be kept in mind that once you go the Cython route, you can't run on PyPy anymore (if I'm not mistaken). Same goes for a hard dependency on pysam.

jamescasbon commented 12 years ago

Yes, I think this is a major concern. I haven't seen that wide support for bcf, but it may arrive.

I suppose there are several ways to proceed:

[1] http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-again-string.html [2] http://vcftools.sourceforge.net/ [3] https://github.com/ekg/vcflib [4] http://code.google.com/p/pysam/source/browse/pysam/cvcf.pyx

On Tue, Feb 14, 2012 at 5:32 PM, Aaron Quinlan reply@reply.github.com wrote:

As expected, looping through the samples in _parse_samples is rather slow for files with many samples (e.g., [1]).  It seems that the main bottlenecks are Python's split() and the fact that looping in Python is known to be dreadfully slow.

One potential solution would be to port the slowest looping bits to Cython.  Another would be to work with BCF.

Thoughts?  I think dealing with deep and wide files will be crucial for the future.

[1] ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz


Reply to this email directly or view it on GitHub: https://github.com/jamescasbon/PyVCF/issues/25

James Casbon

Population Genetics - http://www.populationgenetics.com/ james.casbon@populationgenetics.com +44 (0)1223 497353

brentp commented 12 years ago

See also #24 we could make the sample parsing lazy so that it's not parsed unless needed. I'm wary of using cython for this since now PyVCF has no dependencies...

jamescasbon commented 12 years ago

Just want to point out that the most efficient optimisation is to use pysam based fetch to find only the rows of interest.

Also, a few short cuts would probably speed up the splits. At the moment we have to split on tab to separate samples, colon to separate fields and comma to separate values. Most values do not need the final split.

jamescasbon commented 12 years ago

Just committed a halving in the number of splits, minor improvement.

However, there is no point improving _parse_samples since >50% of the time is spent in _parse_sample.

before:

Fri Feb 17 11:10:58 2012    1kg.prof

         2701745 function calls (2701621 primitive calls) in 7.593 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   239649    4.116    0.000    6.428    0.000 parser.py:518(_parse_sample)
   668478    1.567    0.000    1.567    0.000 parser.py:443(_map)
  1283830    0.776    0.000    0.776    0.000 {method 'split' of 'str' objects}
      381    0.576    0.002    7.439    0.020 parser.py:490(_parse_samples)
   239649    0.397    0.000    0.397    0.000 parser.py:114(__init__)

after:

Fri Feb 17 11:32:43 2012    1kg.prof

         1685560 function calls (1685436 primitive calls) in 6.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   239649    4.284    0.000    5.848    0.000 parser.py:518(_parse_sample)
   346906    1.133    0.000    1.133    0.000 parser.py:443(_map)
      381    0.533    0.001    6.831    0.018 parser.py:490(_parse_samples)
   589217    0.463    0.000    0.463    0.000 {method 'split' of 'str' objects}
   239649    0.410    0.000    0.410    0.000 parser.py:114(__init__)
   247269    0.039    0.000    0.039    0.000 {method 'append' of 'list' objects}
      654    0.025    0.000    0.025    0.000 {built-in method decompress}
jamescasbon commented 12 years ago

I ran 'test/prof.py time' using pypy and python 2.7. It parses the 1kg test file.

pypy runs in 1.8 seconds, cpython in 3.0 seconds. 40% faster is not bad!

arq5x commented 12 years ago

Just an FYI,

I am playing around with the following two options.

[3] https://github.com/ekg/vcflib [4] http://code.google.com/p/pysam/source/browse/pysam/cvcf.pyx

However, my (perhaps naive) understanding is that linking to C++ libraries with ctypes is a bit murky. Untrue?

brentp commented 12 years ago

if you decide to go that route, why not use cython?

arq5x commented 12 years ago

Yeah, that's what I was getting at. My sense at this point is that the two best options are to:

  1. Work together to port vcflib with Cython (I have a local fork of the C++ lib that includes many of the methods I added to PyVcf)
  2. Use pysam's VCF class which is already written in Cython. I don't quite grok the API yet though.
jamescasbon commented 12 years ago

An update: @arq5x has cythonized the parser https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx

I would propose we include this in the next release. If any watchers want to speak up for pure python (i.e. not including the cythonized parser), speak now.

peterjc commented 12 years ago

Is having a Cython (or other low level) module plus the current code as a pure Python fall back viable? That way you can continue to support Jython and PyPy etc.

jamescasbon commented 12 years ago

@peterjc I think the best route for this would be pypy's support of cython http://mail.python.org/pipermail/cython-devel/2012-February/001930.html

Otherwise, there is a lot of duplication of effort.

peterjc commented 12 years ago

I'm not sure how stable PyPy + cython is yet, but certainly it will work before too long I hope.

Do you care about Python implementations other than C Python and PyPy? e.g. Jython, IronPython, ...

jamescasbon commented 12 years ago

@peterjc jython is still on python 2.5, so I think it is already not supported. Does anyone actually use ironpython? I've never seen it alive.

peterjc commented 12 years ago

Right now I would say PyPy support was more important than Jython, which in turn is more important than IronPython (I don't know anyone using it).

arq5x commented 12 years ago

I'm glad to see you are interested in bringing in the Cython version. Is supporting PyPy a high priority? Speaking from an admittedly naive standpoint, my sense is that few people use it. Am I wrong?

Also, Python packaging still baffles/annoys me. Adding Cython as an installation dependency makes things more complex. What is the best way to package things such that only a GCC compiler, assuming the resulting C (.pyx->.c) file is present in the repo?

Are there better ways to handle this?

jamescasbon commented 12 years ago

@arq5x I agree that pypy is not currently important, no-one really uses it in anger today (I think). I think that it will be important in the medium term.

For the packaging, check pybedtools setup.py which uses cython, if available, otherwise builds the extension with gcc https://github.com/daler/pybedtools/blob/master/setup.py

arq5x commented 12 years ago

Thanks - Yeah, I actually based the setup.py in cyvcf on pybedtools for this reason, but was unable to get it to behave as expected, but I was likely doing something very noob.

Having more expert eyes on that would be welcome.

By the way, the library really fast now and getting pretty mature. I encourage you to test it out on some VCFs from 1000G. It is close to vcflib and plinkseq, making it much more useful for constructing analysis scripts on large datasets.

jamescasbon commented 12 years ago

Hi @arq5x, still want to do this but have scheduled for 0.6.0 as I wanted to get the 4.1 support out first.

arq5x commented 12 years ago

A rational plan. Let me know if I can assist in any way.

jamescasbon commented 12 years ago

I played around with cython and got this working https://github.com/jamescasbon/PyVCF/commit/8ba9bed5ae7664f6e95347b11829d68928fa3a2e

It only cythonizes two functions at around 100 LOC. It's ~30% faster than the cyvcf branch (but does less, no aggregate stats).

Its possible to maintain a python/cython version of 100 LOC, so this could be a way forward.

We'd need to add @arq5x aggregation stuff. I can see this makes sense, since you can collect aggregate stats on parse rather than reiterate over the whole sample list on each site.

jamescasbon commented 12 years ago

Just added an optional cython branch. My 1kg profile takes 3.4s versus 4.7s for cyvcf.

Will probably need to add some logic to the setup.py so that it still builds under pypy and ensure that the pure python code is tested as well - add to tox and travis.

jamescasbon commented 12 years ago

PS that means code branch, not git branch.

arq5x commented 12 years ago

Nice. Is there any intent to precompute num_hets, num_hom_ref, etc., as well as lists of gt_types, gt_phases, etc., or do you want to keep these attrs as they are currently, which requires a second pass through all of the genotype info?

One idea would be to have a parameter for the Reader that dictates whether these attributes should be pre-computed or not. Thoughts? I just struggle with the fact that large lists of genotypes need to be looped through at least twice to compute useful metrics. For example, if I have a file with 1000 genotypes and ask for num_hets, num_hom_ref, num_hom_alt, and num_unknown, the genotypes are looped through 5 different times.

jamescasbon commented 12 years ago

On Wed, Jun 27, 2012 at 1:32 PM, Aaron Quinlan reply@reply.github.com wrote:

Nice.  Is there any intent to precompute num_hets, num_hom_ref, etc., as well as lists of gt_types, gt_phases, etc., or do you want to keep these attrs as they are currently, which requires a second pass through all of the genotype info?

One idea would be to have a parameter for the Reader that dictates whether these attributes should be pre-computed or not.  Thoughts?  I just struggle with the fact the large lists of genotypes need to be looped through twice to compute useful metrics.

Absolutely, I have opened #60 for this.

I imagine to have some class, CallStats or something, that we populate at parse time an attach to the _Record object. That way it is explicit that these are not going to cause us to iterate over the calls again.

jamescasbon commented 12 years ago

Fixed, and for the record pypy is still a bit faster than the cythonized version.

jamescasbon commented 12 years ago

FWIW, I replaced the use of a dict to store the call data with a namedtuple and now the pure python version is as fast as cyvcf.

You could port this change, or start using HEAD, which now has cython support.

arq5x commented 12 years ago

Fantastic, thanks for the heads up.

brentp commented 12 years ago

@jamescasbon, nice! is the getattr as fast as the previous [] ? I never use namedtuple.