related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

PyData prototype IO #23

Closed eric-czech closed 3 years ago

eric-czech commented 4 years ago

This issue will track details related to selection and integration of IO libraries for genetic file formats.

For a discussion on libraries in the Python ecosystem for this, see Python IO libraries.

Any solutions here should also consider what it means to support reads of file collections rather than individual files. This is likely not always as simple as concatenating results from a bunch of individual reads (e.g. what do we do if separate VCF files have different numbers of samples/columns?), so our framework should reflect this. This is analogous to how imread_collection is part of the scikit-image plugin interface.

Rechunking may be an additional concern if files are, for example, separated by chromosome and then stacked into a single dask array.

eric-czech commented 4 years ago

A consideration for how these should be integrated is how extensible any plugin framework for them is. Here are a few python examples that span a spectrum of possibilities:

  1. Pandas (simplest plugin system)
    • IO functions like read_parquet (code), read_html, and read_excel all take "engine" or "flavor" arguments that result in use of a different backend package
    • Each type of file read has an abstract class from which base classes are maintained by Pandas committers for each backend
    • Note: see here for a great example on how to import modules specific to a backend only when needed
    • Global options like io.parquet.engine make it possible to set defaults
    • This isn't really a "plugin" framework, but using importlib to load the module and work with a reference to it makes it similar to one
  2. Scikit-image (somewhat extensible plugins)
    • Plugin docs
    • Plugins allow for overriding of imread, imsave, imshow, etc behavior with a plugin argument on each of those.
    • They are represented as a single module file and an accompanying .ini file
    • The .ini file describes what the plugin supports so this much can be known without needing to import the module
    • There isn't a framework used for the plugins, it's simply a convention for skimage committers to follow when adding new ones
    • The .ini descriptor files DO NOT actually maintain what dependencies the plugin has (so it kind of defeats the purpose of a more formal plugin framework IMO)
    • We may be able to use something like yapsy to avoid re-inventing this wheel
  3. Flask (very extensible plugins)
    • Flask allows for integration of CLI extensions through the setuptools entry_point system
    • For example, Flask specifically uses the entry point flask.commands as a way for someone to build a PyPI package that is compatible with Flask's dependencies and then when both are installed, the plugin functionality is automatically available in Flask

My take on this is that the lack of making plugin dependencies a managed part of the framework in #2 (i.e. what scikit-image does) makes the extra abstraction mostly unnecessary. Realistically, I think the uses of some IO plugin will result from either:

I highly doubt anyone is going to ever publish nice PyPI packages that operate as IO plugins to our project (presumably skallel eventually), so I'm favoring the Pandas model at the moment. This seems like the best way to support a few different backends for the same things and it could easily be customized by a user in the second situation above by us allowing them to pass engine arguments equal to base class instances, rather than a string name, that conform to our protocol/interface in an abstract class (or some other runtime-only hook).

@hammer I'm sure you've seen this play out a ton of times -- got any words of wisdom/caution?

eric-czech commented 4 years ago

Note: Jeff mentioned that the new Spark IO extension framework may make for good inspiration:

eric-czech commented 4 years ago

On reading file collections:

hammer commented 4 years ago

Note: Jeff mentioned that the new Spark IO extension framework may make for good inspiration

The reason I think this framework may make for good inspiration is because the first version of the extension framework did not allow extensions to pass enough metadata to Spark to allow for query plan optimization.

One other point we discussed in Slack is that not installing IO extensions is a good idea because many of them will make use of compression and/or encryption native libraries that create complexity for installation. However, it's critical to design a pleasant UX for identifying and downloading extensions so that the user is not confused when they are not able to work with a file format they perceive to be standard.

hammer commented 4 years ago

Other sources of inspiration

eric-czech commented 4 years ago

Here is a notebook that sketches a few ideas for this: I/O Notebook.

Some thoughts here:

hammer commented 4 years ago

To test our IO libraries, it would be useful to gather a library of small, ugly files to parse, for example https://discuss.hail.is/t/import-vcf-from-an-ugly-file-format-issues/1404.