sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
236 stars 32 forks source link

Data ingest #2

Closed alimanfoo closed 4 years ago

alimanfoo commented 4 years ago

Raising an issue to begin discussions of data ingest functionality. Maybe too broad ultimately and need to be broken up, but starting point.

Some sources of variant call data we may want to ingest:

We assume that the common endpoint for the ingest process is an xarray dataset.

Some questions:

alimanfoo commented 4 years ago

Elaborating a little, for formats like VCF, we usually want to convert to an intermediate representation like Zarr for efficiency reasons. If so, what are the paths we want to support?

E.g., for VCF we might do:

Function A could be the vcf_to_zarr function as currently implemented in scikit-allel. If so, we would just need to implement function B, which could be something like the zarrvcf_to_xarray function I prototyped recently.

I'd like to make a soft proposal that:

Any comments? If no objections, next step I'd flesh out a proposed API for open_zarrvcf().

alimanfoo commented 4 years ago

Examples of public datasets available as VCF and Zarr-VCF:

eric-czech commented 4 years ago

Makes sense to me to use scikit-allele for as much as possible. Does vcf_to_zarr introduce any dependencies beyond cython/numpy/dask/zarr?

re: xarray representation, I put up a few thoughts on the discourse

jeromekelleher commented 4 years ago

This sounds like a great starting point @alimanfoo.

Longer term, I'd argue against having functions that parse/process things like vcf/bgen/plink within the core sgkit library. Unless we're going to code these up from scratch ourselves (which is a fair amount of work) this would introduce problematic dependencies. Maybe the best thing to do is to have ancillary packages to support converting these file formats to sgkit xarray. So, sgkit only deals with its own data representation, but there are utilities that we maintain to convert data into sgkit format (e.g., vcf2sg, plink2sg, bgen2sg, ...). Each of these utilities is independent (i.e., lives in its own repo), and can use whatever library is most appropriate (i.e., cyvcf2 for vcf2sg, etc). That way there's a nice clean boundary, and it's clear what goes into sgkit and what doesn't. Adding support for format X, means writing a conversion utility from format X into sgkit.

We certainly want to be able to represent data that's in these other formats though - that's crucial.

alimanfoo commented 4 years ago

Longer term, I'd argue against having functions that parse/process things like vcf/bgen/plink within the core sgkit library. Unless we're going to code these up from scratch ourselves (which is a fair amount of work) this would introduce problematic dependencies. Maybe the best thing to do is to have ancillary packages to support converting these file formats to sgkit xarray.

+1

So for VCF we could rely on scikit-allel for the time being, but then as some point (maybe further down the line) figure out whether to extract out and migrate that VCF ingest code into a new pystatgen repo, or to build a new VCF ingest package using a different dependency like cyvcf2 or directly on htslib.

jeromekelleher commented 4 years ago

So for VCF we could rely on scikit-allel for the time being, but then as some point (maybe further down the line) figure out whether to extract out and migrate that VCF ingest code into a new pystatgen repo, or to build a new VCF ingest package using a different dependency like cyvcf2 or directly on htslib.

Sounds like a great plan. Let's get ourselves up and running doing the interesting stuff in sgkit as quickly as we can, and then circle back later to deal with the messy details of ingesting data once we have useful functionality. (Modulo making sure that we can represent data from the different sources reasonably losslessly.)

alimanfoo commented 4 years ago

Does vcf_to_zarr introduce any dependencies beyond cython/numpy/dask/zarr?

I think it's just cython, numpy and zarr.

hammer commented 4 years ago

In my discussions with the Google Cloud team they pointed me to their https://github.com/googlegenomics/gcp-variant-transforms project, which includes some pretty sophisticated code for handling VCF files that we may want to examine and potentially either factor out or vendor into our repo.

I'll also point to the relevant Discourse topic for those that encounter this project on GitHub rather than Discourse.

alimanfoo commented 4 years ago

Looks like gcp-variant-transforms is using pysam for the VCF parsing.

alimanfoo commented 4 years ago

Btw when I wrote the VCF parser currently in scikit-allel, I decided not to go with pysam mainly because it doesn't compile on Windows and I was trying to be Windows-friendly. But VCF parsing code is hard to get right, hard to make go fast, and hard to maintain. So if it makes sense in the long term to rely on pysam or htslib directly or some other library, even if we can't get that to work on Windows, that may be the best trade-off.

jeromekelleher commented 4 years ago

Btw when I wrote the VCF parser currently in scikit-allel, I decided not to go with pysam mainly because it doesn't compile on Windows and I was trying to be Windows-friendly. But VCF parsing code is hard to get right, hard to make go fast, and hard to maintain. So if it makes sense in the long term to rely on pysam or htslib directly or some other library, even if we can't get that to work on Windows, that may be the best trade-off.

Another excellent reason for the converters to live in their own repos! I 100% agree that sgkit should support Windows, but writing code that'll parse VCFs and plink files quickly on Windows is not where we should be putting our effort IMO.

The more I think about this the more strongly I feel about it - sgkit should know nothing at all about bioinfo file formats (other than being able to represent their data), and conversion into sgkit format should be done by standalone conversion utilities. I shouldn't have to struggle with installing pysam if I'm working with bgens, and vice versa for whatever bgen reader we use if I'm converting VCF/BCF.

Conversion is a once-off pain, and I don't think it's that big a burden to say "if you want to import VCF data into sgkit, please run pip install vcf2sg, and vcf2sg myfile.vcf > myfile.sgzarr.

tomwhite commented 4 years ago

+1 to controlling dependency explosion early on, although it does mean we'll have more repos to manage.

We might have an umbrella package (sgkit[all]?) to make it easy for users who don't want to think too hard about how we've structured things, so they can install everything in one go.

As an example of how integrating backends is not always clear cut, we currently have two backends for BGEN - one using PyBGEN and one using bgen-reader - since we haven't yet settled on a preferred one. My preference would be to pick one that we use in sgkit, but there is likely to be a bit of work to get to that point, since factors like scaling to large BGEN files are important and need testing. We will also probably have to do work in the upstream library. There's more discussion at https://discourse.smadstatgen.org/t/python-io-libraries/13/7.

jeromekelleher commented 4 years ago

As an example of how integrating backends is not always clear cut, we currently have two backends for BGEN - one using PyBGEN and one using bgen-reader - since we haven't yet settled on a preferred one. My preference would be to pick one that we use in sgkit, but there is likely to be a bit of work to get to that point, since factors like scaling to large BGEN files are important and need testing. We will also probably have to do work in the upstream library. There's more discussion at https://discourse.smadstatgen.org/t/python-io-libraries/13/7.

There's an important strategic point here I think. When you say "backend" do you think of sgkit working directly with BGEN files, or do you imagine using this backend to convert BGEN data into native sgkit form? I think it would be a fundamental mistake to try to support external file formats as something we work directly with (as opposed to supporting once-off conversion into our own format). In practise, we'll always end up doing a full conversion first into our own format anyway (have you tried getting VCF data into efficient numpy format? I've yet to find a library that'll do it properly). The dependencies for the libraries are pretty horrible on their own, but mixing them all together is a really enticing prospect. I'd bet good money we'd end up compiling htslib at least twice from a pypi install, if we included libs for parsing a bunch of formats.

Having four of five repos to support conversion for the most important formats isn't a big deal, and it's far preferable to polluting the core sgkit statistical code with nasty dependencies.

jeromekelleher commented 4 years ago

We might have an umbrella package (sgkit[all]?) to make it easy for users who don't want to think too hard about how we've structured things, so they can install everything in one go.

Having a metapackage would be handy all right, but I'm wary about combining the dependencies and the support burden this might imply when users install everything instead of just (say) the VCF parser they actually need. Anyway, this is a long way in the future - right now we just want to get some data into the format we need so that we can start developing some interesting (i.e., non-boring format conversion stuff) functionality, right?

tomwhite commented 4 years ago

When you say "backend" do you think of sgkit working directly with BGEN files, or do you imagine using this backend to convert BGEN data into native sgkit form?

IO backends will always convert into the native sgkit format.

hammer commented 4 years ago

I just realized I should link to our prior discussion on this topic at Related Sciences: https://github.com/related-sciences/gwas-analysis/issues/23

I also wanted to point out there are several distinctions here:

I also want to mention that our lab put a lot of energy into writing a somatic mutation caller at https://github.com/hammerlab/guacamole. Ultimately our work was stymied by issues in the underlying third-party IO library, https://github.com/HadoopGenomics/Hadoop-BAM, which we had to fix ourselves by developing https://github.com/hammerlab/spark-bam. Tom eventually tried to unify these efforts at https://github.com/disq-bio/disq.

I mention the above story because I suspect we may end up following a similar path for one or more of our file formats: first, depend upon third party library; second, create first party library to fix deficiencies in the third party library; third, recruit additional developers so that our first party library can ultimately be sustained by a third party.

I want to be sure we balance the development of interesting functionality with robust and performant IO libraries. The lesson from Guacamole, for me, is to invest more energy up front in IO libraries. So while I agree with our current direction of depending upon third party libraries, I just want to prepare everyone for the work we're going to have to put into the IO libraries eventually.

jeromekelleher commented 4 years ago

Sounds good @hammer, all good points. I do want to push back (again) on the idea of doing VCF/plink/bgen IO within sgkit. I'm going to focus on VCF, for simplicity, but the points will be the same for other formats.

VCF/BCF data is not stored in a way that can ever be efficiently transformed into columnar numpy arrays. Therefore, there will always be a conversion step required to go from vcf -> sgkit. The conversion will not be lossless (VCF is far too loose, and we should not try to convert entirely losslessly every quirk of VCF), and there will be decisions to made, parameters to provide etc. So, there will always be a step sg_dataset = convert_vcf(path), and it makes little sense to redo this (it will be expensive), so we'll always do something like sgkit.dump(sg_dataset, path). So, from a user's perspective, the only difference we're actually talking about is whether they call sgkit.convert_vcf(path), or vcf2sg.convert(path) (or call vcf2sg on the command line as a pre-processing step).

OK, so we do end up with a handful of converter libraries. This is no biggie, right? All the VCF code ends up in the vcf2sg repo. If we decide that we need to write our own BCF parser, then this is a natural place to put it. It is naturally isolated from both sgkit, and from all the other file IO code. If we don't do this, then sgkit will become a sprawling mess incorporating dozens of quirks and workarounds from all the different formats. Just converting data from VCF is itself a big job, and we will end up having a lot of (difficult to test) code to do this. Let's keep sgkit itself about the actual statistical genetics, and not the miserable wretchedness of the current file format ecosystem.

Re users who will want to install everything - I can see this being handy, but I don't think it's that common a requirement. Most of the time, users are working with either VCFs, or plink, or... not all of them at once. The burden of keeping all of the conversion code for all of the formats and all of the dependencies that implies (for both end users and CI) to me makes it overwhelmingly better to have the minor inconvenience of installing one more package (vcf2sg).

Importing VCF etc is a chore that we have to do to get to the exciting bits: using the pydata ecosystem to do statistical genetics efficiently and conveniently in Python. Let's not compromise the exciting and interesting bit for the sake of the boring chore.

tomwhite commented 4 years ago

@jeromekelleher I think there's general agreement about keeping the complexity of file format code outside the core repo.

right now we just want to get some data into the format we need so that we can start developing some interesting (i.e., non-boring format conversion stuff) functionality, right?

+1

hammer commented 4 years ago

If we are going to send IO code out to separate repos, we need to figure out how we will partition code across repos, and how to name each repo.

For example, we could create a repo to match each third-party library upon which we depend. Take BGEN files: we could have separate sgkit-pybgen and sgkit-bgen-reader repos. Or, we could make a single sgkit-bgen repo and put interfaces to multiple third-party libraries in that one repo.

Another issue: how should we handle families of formats, e.g. VCF, BCF, and gVCF. Do we want one repo for each format, or a single repo that handles the whole family?

One more issue: how should we handle graceful degradation in situations when we have optimized (e.g. using native code) and unoptimized (e.g. pure Python) IO libraries for the same file format? Some users may have small VCF files, for example, and will only require a pure Python library, while others may want to read the gnomAD VCF, for which performance is critical.

jeromekelleher commented 4 years ago

I think sgkit-vcf (e.g.) is probably a good option, and this repo should handle VCF, BCF and gVCF. I'd imagine we'll end up needing to use htslib for this anyway, either through pysam/cyvcf2 or via our own wrapper. I'd imagine using cyvcf2 will be fine initially. A great advantage of using htslib is that it's canonical - we don't have to get into discussions about what is and isn't a valid VCF file.

From a users perspective, I don't think we should have separate repos like sgkit-pybgen and sgkit-bgen-reader, as I wouldn't know which one to use. Also, I'm not sure why we'd want to have multiple converters - wouldn't it be better to focus efforts on one (after an initial prototyping phase to figure out which one works better)?

tomwhite commented 4 years ago

Agree with sgkit-vcf.

For BGEN I've been using both PyBGEN and bgen-reader, and while I don't know which one we should use yet, I hope we can pick one and host in sgkit-bgen.

hammer commented 4 years ago

Okay to close this issue now that we have a general approach settled?