hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.
https://genomedata.hoffmanlab.org/
GNU General Public License v2.0
2 stars 1 forks source link

Change genomedata-load-data from binary to Python with c-extension #57

Closed EricR86 closed 3 years ago

EricR86 commented 3 years ago

This PR replaces genomedata-load-data, which was previously a built c binary through distutils to a python script that calls an entry point into a python extension implemented a c shared library.

This has some immediate benefits:

  1. Development can be done in edit mode now. pip install -e . works and enables debugging on the extension as well.
  2. Removal of all distutils undocumented extensions and hooks to get the binary building through a pip install
  3. Distribution can be done by using wheels which should simplify and speed up installation.
  4. Net removal of code.

Some important notes:

  1. Python 2 support might be tenuous due to usage of the C Python API.
  2. The interface to genomedata-load-data is slightly different in formatting and help text. It also currently does not have -? as an option for --help while the previous binary version did.
  3. The extension must be compiled currently with the additional -UNDEBUG to keep current assert macros from compiling out since they wrap function calls with side effects that also have return values.
  4. This currently is implemented as a drop-in replacement for the previous genomedata-load-data and as such the c-extension function call reads directly from stdin and was marked with an underscore to notify that it's generally not a good python function to call.
  5. Other systems might be able to be supported. Currently there is no way to test this but as far as I can tell the code looks to be mostly platform agnostic but there's no testing capacity. Notably, the gnulib library has not been fully removed though it seems optional to get the extension building on Linux-based systems.

Profiling

Python C extension

$ time ./create_H3K4me1_archive.sh 
+ CELL=DOHH2
+ MARKS=(H3K4me1)
+ ACCESSIONS=(ENCFF509XSM)
+ for i in ${!MARKS[@]}
+ ACCESSION=ENCFF509XSM
+ MARK=H3K4me1
+ genomedata-load --assembly --sequence 'chr*.agp.gz' --assembly-report GCA_000001405.28_GRCh38.p13_assembly_report.txt --track H3K4me1=ENCFF509XSM_DOHH2_H3K4me1.bigWig --maskfile k36_umap_multiread_filtered.bed ENCFF509XSM_DOHH2_H3K4me1_test.genomedata

real    13m8.923s
user    14m0.478s
sys     0m54.084s

C Binary

$ time ./create_H3K4me1_archive.sh 
+ CELL=DOHH2
+ MARKS=(H3K4me1)
+ ACCESSIONS=(ENCFF509XSM)
+ for i in ${!MARKS[@]}
+ ACCESSION=ENCFF509XSM
+ MARK=H3K4me1
+ genomedata-load --assembly --sequence 'chr*.agp.gz' --assembly-report GCA_000001405.28_GRCh38.p13_assembly_report.txt --track H3K4me1=ENCFF509XSM_DOHH2_H3K4me1.bigWig --maskfile k36_umap_multiread_filtered.bed ENCFF509XSM_DOHH2_H3K4me1_test.genomedata

real    13m58.493s
user    14m51.622s
sys     1m3.137s

There does not seem to be a significant change in performance from a cursory profiling test.

EricR86 commented 3 years ago

The Genomedata Docker testing/build image needs to be updated to include the Python development packages.

michaelmhoffman commented 3 years ago

That's awesome.

The extension must be compiled currently with the additional -UNDEBUG to keep current assert macros from compiling out since they wrap function calls with side effects that also have return values.

I think using assert() for this purpose was a mistake (my mistake, obviously). See #47.

Instead of adding -UNDEBUG can you #undef NDEBUG at the top of the C file? I think that will be safer and be more local.

EricR86 commented 3 years ago

It is possible to get this simple python extension working in both Python 2 and 3 but as it is it does not work. https://docs.python.org/2/howto/cporting.html?highlight=pymoduledef

I've disabled the Python 2 build, for now, to verify everything else works.

After responses to feedback and a decision on Python version portability I will clean up.

michaelmhoffman commented 3 years ago

Python 2 has been EOL for 15 months. Avoiding this extra work is a good enough reason to toss Python 2 compatibility aside at this point.

Done with my replies.

EricR86 commented 3 years ago

I will remove Python 2 from the Genomedata docker build image and update the regression test environment accordingly.

EricR86 commented 3 years ago

This PR is now ready for review @michaelmhoffman

Of note, I've bumped the minimum Python 3 version to 3.7 as it is currently the minimum supported version for the current Numpy.

The Python 3 C API has been stable from 3.2 onwards and the previous minimum version we supported was 3.4.