Closed arq5x closed 12 years ago
Agreed, this is important. But it should be kept in mind that once you go the Cython route, you can't run on PyPy anymore (if I'm not mistaken). Same goes for a hard dependency on pysam.
Yes, I think this is a major concern. I haven't seen that wide support for bcf, but it may arrive.
I suppose there are several ways to proceed:
[1] http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-again-string.html [2] http://vcftools.sourceforge.net/ [3] https://github.com/ekg/vcflib [4] http://code.google.com/p/pysam/source/browse/pysam/cvcf.pyx
On Tue, Feb 14, 2012 at 5:32 PM, Aaron Quinlan reply@reply.github.com wrote:
As expected, looping through the samples in _parse_samples is rather slow for files with many samples (e.g., [1]). It seems that the main bottlenecks are Python's split() and the fact that looping in Python is known to be dreadfully slow.
One potential solution would be to port the slowest looping bits to Cython. Another would be to work with BCF.
Thoughts? I think dealing with deep and wide files will be crucial for the future.
[1] ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz
Reply to this email directly or view it on GitHub: https://github.com/jamescasbon/PyVCF/issues/25
James Casbon
Population Genetics - http://www.populationgenetics.com/ james.casbon@populationgenetics.com +44 (0)1223 497353
See also #24 we could make the sample parsing lazy so that it's not parsed unless needed. I'm wary of using cython for this since now PyVCF has no dependencies...
Just want to point out that the most efficient optimisation is to use pysam based fetch to find only the rows of interest.
Also, a few short cuts would probably speed up the splits. At the moment we have to split on tab to separate samples, colon to separate fields and comma to separate values. Most values do not need the final split.
Just committed a halving in the number of splits, minor improvement.
However, there is no point improving _parse_samples since >50% of the time is spent in _parse_sample.
before:
Fri Feb 17 11:10:58 2012 1kg.prof
2701745 function calls (2701621 primitive calls) in 7.593 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
239649 4.116 0.000 6.428 0.000 parser.py:518(_parse_sample)
668478 1.567 0.000 1.567 0.000 parser.py:443(_map)
1283830 0.776 0.000 0.776 0.000 {method 'split' of 'str' objects}
381 0.576 0.002 7.439 0.020 parser.py:490(_parse_samples)
239649 0.397 0.000 0.397 0.000 parser.py:114(__init__)
after:
Fri Feb 17 11:32:43 2012 1kg.prof
1685560 function calls (1685436 primitive calls) in 6.986 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
239649 4.284 0.000 5.848 0.000 parser.py:518(_parse_sample)
346906 1.133 0.000 1.133 0.000 parser.py:443(_map)
381 0.533 0.001 6.831 0.018 parser.py:490(_parse_samples)
589217 0.463 0.000 0.463 0.000 {method 'split' of 'str' objects}
239649 0.410 0.000 0.410 0.000 parser.py:114(__init__)
247269 0.039 0.000 0.039 0.000 {method 'append' of 'list' objects}
654 0.025 0.000 0.025 0.000 {built-in method decompress}
I ran 'test/prof.py time' using pypy and python 2.7. It parses the 1kg test file.
pypy runs in 1.8 seconds, cpython in 3.0 seconds. 40% faster is not bad!
Just an FYI,
I am playing around with the following two options.
[3] https://github.com/ekg/vcflib [4] http://code.google.com/p/pysam/source/browse/pysam/cvcf.pyx
However, my (perhaps naive) understanding is that linking to C++ libraries with ctypes is a bit murky. Untrue?
if you decide to go that route, why not use cython?
Yeah, that's what I was getting at. My sense at this point is that the two best options are to:
An update: @arq5x has cythonized the parser https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx
I would propose we include this in the next release. If any watchers want to speak up for pure python (i.e. not including the cythonized parser), speak now.
Is having a Cython (or other low level) module plus the current code as a pure Python fall back viable? That way you can continue to support Jython and PyPy etc.
@peterjc I think the best route for this would be pypy's support of cython http://mail.python.org/pipermail/cython-devel/2012-February/001930.html
Otherwise, there is a lot of duplication of effort.
I'm not sure how stable PyPy + cython is yet, but certainly it will work before too long I hope.
Do you care about Python implementations other than C Python and PyPy? e.g. Jython, IronPython, ...
@peterjc jython is still on python 2.5, so I think it is already not supported. Does anyone actually use ironpython? I've never seen it alive.
Right now I would say PyPy support was more important than Jython, which in turn is more important than IronPython (I don't know anyone using it).
I'm glad to see you are interested in bringing in the Cython version. Is supporting PyPy a high priority? Speaking from an admittedly naive standpoint, my sense is that few people use it. Am I wrong?
Also, Python packaging still baffles/annoys me. Adding Cython as an installation dependency makes things more complex. What is the best way to package things such that only a GCC compiler, assuming the resulting C (.pyx->.c) file is present in the repo?
Are there better ways to handle this?
@arq5x I agree that pypy is not currently important, no-one really uses it in anger today (I think). I think that it will be important in the medium term.
For the packaging, check pybedtools setup.py which uses cython, if available, otherwise builds the extension with gcc https://github.com/daler/pybedtools/blob/master/setup.py
Thanks - Yeah, I actually based the setup.py in cyvcf on pybedtools for this reason, but was unable to get it to behave as expected, but I was likely doing something very noob.
Having more expert eyes on that would be welcome.
By the way, the library really fast now and getting pretty mature. I encourage you to test it out on some VCFs from 1000G. It is close to vcflib and plinkseq, making it much more useful for constructing analysis scripts on large datasets.
Hi @arq5x, still want to do this but have scheduled for 0.6.0 as I wanted to get the 4.1 support out first.
A rational plan. Let me know if I can assist in any way.
I played around with cython and got this working https://github.com/jamescasbon/PyVCF/commit/8ba9bed5ae7664f6e95347b11829d68928fa3a2e
It only cythonizes two functions at around 100 LOC. It's ~30% faster than the cyvcf branch (but does less, no aggregate stats).
Its possible to maintain a python/cython version of 100 LOC, so this could be a way forward.
We'd need to add @arq5x aggregation stuff. I can see this makes sense, since you can collect aggregate stats on parse rather than reiterate over the whole sample list on each site.
Just added an optional cython branch. My 1kg profile takes 3.4s versus 4.7s for cyvcf.
Will probably need to add some logic to the setup.py so that it still builds under pypy and ensure that the pure python code is tested as well - add to tox and travis.
PS that means code branch, not git branch.
Nice. Is there any intent to precompute num_hets, num_hom_ref, etc., as well as lists of gt_types, gt_phases, etc., or do you want to keep these attrs as they are currently, which requires a second pass through all of the genotype info?
One idea would be to have a parameter for the Reader that dictates whether these attributes should be pre-computed or not. Thoughts? I just struggle with the fact that large lists of genotypes need to be looped through at least twice to compute useful metrics. For example, if I have a file with 1000 genotypes and ask for num_hets, num_hom_ref, num_hom_alt, and num_unknown, the genotypes are looped through 5 different times.
On Wed, Jun 27, 2012 at 1:32 PM, Aaron Quinlan reply@reply.github.com wrote:
Nice. Is there any intent to precompute num_hets, num_hom_ref, etc., as well as lists of gt_types, gt_phases, etc., or do you want to keep these attrs as they are currently, which requires a second pass through all of the genotype info?
One idea would be to have a parameter for the Reader that dictates whether these attributes should be pre-computed or not. Thoughts? I just struggle with the fact the large lists of genotypes need to be looped through twice to compute useful metrics.
Absolutely, I have opened #60 for this.
I imagine to have some class, CallStats or something, that we populate at parse time an attach to the _Record object. That way it is explicit that these are not going to cause us to iterate over the calls again.
Fixed, and for the record pypy is still a bit faster than the cythonized version.
FWIW, I replaced the use of a dict to store the call data with a namedtuple and now the pure python version is as fast as cyvcf.
You could port this change, or start using HEAD, which now has cython support.
Fantastic, thanks for the heads up.
@jamescasbon, nice! is the getattr as fast as the previous [] ? I never use namedtuple.
As expected, looping through the samples in _parse_samples is rather slow for files with many samples (e.g., [1]). It seems that the main bottlenecks are Python's split() and the fact that looping in Python is known to be dreadfully slow.
One potential solution would be to port the slowest looping bits to Cython. Another would be to work with BCF.
Thoughts? I think dealing with deep and wide files will be crucial for the future.
[1] ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz