Rewrite genomedata-load-data in pure Python

EricR86 commented 8 years ago

Original report (archived issue) by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).

I wonder whether it would be useful to rewrite genomedata-load-data in pure Python using h5py. I assume h5py was not available or immature at the time the C implementation was written. I had a stab at this and I don't think it would be too bad.

Advantages

Better cross-platform support
Easier installation, no compiled scripts
Easier to code support for new formats, corner cases in existing formats (e.g., weird track lines in WIG files)

Disadvantages

Probably slower loading. But this could be offset with (currently not implemented) parallelization.

Has the loading step ever been profiled? What is the slowest step of loading? If it's reading raw data, the C implementation will likely be faster. If it's writing, I would imagine h5py is heavily optimized and would likely have performance on par with the C interface.

EricR86 commented 8 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).

The original version of genomedata-load-data was in pure Python and I couldn't load a dense wiggle file annotating the human genome within a few days. I believe most of the time was spent creating string objects, converting them to floats, etc. This is stuff that is very wasteful in Python given all the objects that need to be created.

EricR86 commented 7 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).

I think it would be possible to re-implement the loader in cython (using cdef type declarations) and have similar performance. I'll do some benchmarking to see if this is a viable route.

Would you take a small performance hit if it improved code portability? We could ditch the gnulib dependency and I hope it would fix the distutils issue that is preventing full py3 compatibility.

EricR86 commented 7 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).

I'd second a possible attempt at writing the loader in Cython especially if there isn't a huge performance hit. It would be especially nice if it could lead to expanding out the interface in the future to what the current unused interface in _load_data.py is currently (chunk-size, possible filtering, etc).

hoffmangroup / genomedata

Rewrite genomedata-load-data in pure Python #19

Advantages

Disadvantages