sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
236 stars 32 forks source link

Read data from VCF files #104

Closed hammer closed 4 years ago

hammer commented 4 years ago

We can read data from VCF Zarr files with #40.

Ultimately we will need to read data from VCF files directly.

https://github.com/brentp/cyvcf2 looks to be a good option for this purpose.

Understanding our options here will also help us reach a conclusion on #65.

tomwhite commented 4 years ago

I spent some time this week prototyping what this would look like.

The first piece is a function to read a (bgzip compressed) VCF file using cyvcf2 and convert it into sgkit xarray format. It does this by reading a chunk of N variants (N=10,000 for my experiment), creating an xarray dataset for those variants, then calling to_zarr on the dataset with append_dim="variants" to append to a Zarr store on disk. This is very similar to the way scikit-allel's vcf_to_zarr function is structured.

So far, everything is serial. To run in parallel, there needs to be a way to i) divide the VCF(s) into parts to be processed, and ii) merge the resulting Zarr stores.

For i), since VCF files are commonly tabix indexed, we could use genomic regions to partition the VCF(s) into roughly equal parts, each of which would be processed independently to create a Zarr store. We'd have to come up with some heuristics for doing a decent job on partitioning (e.g. like this).

Then, for ii), the problem is that each Zarr store will have a different number of variants, and even though the chunk sizes are the same, they need to be rechunked in the variant dimension.

For the sake of example, if the chunk size was 10, and the first Zarr store had (3, ) variants, and the second had (10, 6) variants, then they would be merged and rechunked as (3+7, 3+6) = (10, 9) variants.

This can be achieved in Xarray with

ds = xr.concat(datasets, dim="variants")
ds = ds.chunk({"variants": chunk_size})
ds.to_zarr(output, mode="w")

However, there may be some caveats with running at scale, as discussed in https://github.com/dask/dask/issues/5105 (by @alimanfoo!), which may need work (rechunker may be an option too). I haven't looked at performance, but VCF parsing should be very fast with cyvcf2, since it is a thin interface to htslib (see the paper: https://academic.oup.com/bioinformatics/article/33/12/1867/2971439).

This two step-scheme should scale well (using Dask), since it is basically two embarrassingly parallel steps, mediated by a set of intermediate Zarr stores.

There are some build issues we'd probably want to address too:

jeromekelleher commented 4 years ago

This sounds like a good plan @tomwhite.

hammer commented 4 years ago

For some very large VCF files such as those provided by gnomAD we may need some strategy to parallelize across subjects as well, though I'm not sure that will be possible...

tomwhite commented 4 years ago

I've implemented the design in my previous comment https://github.com/pystatgen/sgkit/issues/104#issuecomment-673454272 here: https://github.com/tomwhite/sgkit-vcf

For partititioning, I use the Tabix linear index to find BGZF block boundaries, and use those to split the compressed VCF file into roughly equal parts. Of course, these parts are compressed and may contain different numbers of variants, but it seems to work quite well in practice.

Running some profiling on 1000 genomes chr22, it takes 8min 5s running sequentially, versus 4min 1s running in parallel on my 4 core machine. The memory/CPU profiles are here:

https://nbviewer.jupyter.org/github/tomwhite/sgkit-vcf/blob/master/vcf_prof.ipynb

There are two phases, the first 180s is VCF parsing, the rest is rechunking Zarr files. It does a pretty good job of using all the CPU, particularly during the second phase.

I compared with scikit-allel too (which runs sequentially), and it took 5min 29s. Note that it's not strictly a like-for-like comparision since scikit-allel does not provide phasing info, and doesn't store a mask either.

eric-czech commented 4 years ago

Very cool @tomwhite! Does the scikit-allel function subset to a specific set of fields as well or does it try to convert all of them?

tomwhite commented 4 years ago

@eric-czech scikit-allel converts roughly the same set of fields (calls, variants, samples).

eric-czech commented 4 years ago

I see, would you say it's in scope for this reader to also pull other call fields like GQ or AD or other variant info fields? I'm looking at https://nbviewer.jupyter.org/gist/alimanfoo/1240aaccab59567eb6502fd764f57751.

I'm basically wondering if/when we try to tackle supporting extra fields that we may not yet use in any other sgkit functions.

jeromekelleher commented 4 years ago

This is awesome @tomwhite! Splitting up aligned with BGZF blocks is a great idea. It looks like the current approach depends on an actual tabix index, though, and not all VCFs are tabix indexed. Files produced by more recent versions of htslib have CSI indexes.

Have you tried this code on a BCF file?

(@tomwhite - this is getting technical though, maybe we should take this discussion over to tskit-vcf?)

jeromekelleher commented 4 years ago

I'm basically wondering if/when we try to tackle supporting extra fields that we may not yet use in any other sgkit functions.

I'd suggest we keep the number of fields we import to a minimum initially, but we'll certainly need to be able to import this stuff at some point. Even if we don't use the fields in sgkit, users will want to be able to access them.

tomwhite commented 4 years ago

I see, would you say it's in scope for this reader to also pull other call fields like GQ or AD or other variant info fields? I'm looking at https://nbviewer.jupyter.org/gist/alimanfoo/1240aaccab59567eb6502fd764f57751.

In the file I tried scikit-allel pulled in ALT, CHROM, FILTER_PASS, ID, POS, QUAL, REF.

In the new code using cyvcf2 it would be trivial to store other variant INFO fields, like GQ or AD. So it's in scope for this reader, but not necessarily for the first version.

tomwhite commented 4 years ago

It looks like the current approach depends on an actual tabix index, though, and not all VCFs are tabix indexed. Files produced by more recent versions of htslib have CSI indexes.

cyvcf2 works with CSI indexes, but I'll need to extend the partition calculation code for reading TBI to support CSI indexes too. That shouldn't be too hard since CSI is virtually the same as TBI.

Have you tried this code on a BCF file?

Not yet, but CSI-indexed BCF should work.

eric-czech commented 4 years ago

In the new code using cyvcf2 it would be trivial to store other variant INFO fields, like GQ or AD. So it's in scope for this reader, but not necessarily for the first version.

Ok cool. Just to clarify though, is it still trivial for fields like GQ and AD if they have variants and samples dimensions?

An API design question I had with VCF is whether or not we can simply let users specify the extra fields to fetch along with their associated dimensions and data types rather than trying to manage them explicitly.

eric-czech commented 4 years ago

Now that I'm thinking about it, I have a hard time imagining how the code could extract multiple call data fields as zarr without doing multiple passes over the whole VCF file. Do you know if that's how it would have to work @tomwhite?

tomwhite commented 4 years ago

Sorry, I was just referring to variant INFO fields above. GQ is a genotype field (conditional genotype quality), not a variant INFO field as I mistakenly said above. However, it's still possible to retrieve GQ using cyvcf2, so we could store it in xarray just like call_genotype is now. This wouldn't require multiple passes over the VCF file.

eric-czech commented 4 years ago

This wouldn't require multiple passes over the VCF file.

Niceee. I was worried there was no good way to have two call data variables like GT and GQ that come from a non-columnar file format write to Zarr using Xarray without multiple passes.

tomwhite commented 4 years ago

I've extended my branch at https://github.com/tomwhite/sgkit-vcf to support chunking in the samples dimension (all samples still have to be read into memory when reading from VCF). I've also added support for BCF and CSI indexing. (The latter was harder than I thought it was going to be since CSI doesn't have a linear index, but rather only supplies offsets for each bin.)

I think the next step would be to move it into a sgkit-vcf repo here, which means figuring out how to install htslib for CI.

jeromekelleher commented 4 years ago

Wow, awesome work @tomwhite, that was super fast!

I think the next step would be to move it into a sgkit-vcf repo here, which means figuring out how to install htslib for CI.

We could try using the Debian htslib packages, but it'll still be fun trying to get cyvcf to actually use them. I think we'll end up just building htslib each time as part of the cyvcf2 install, which should mostly work once libcurl is installed I think.

hammer commented 4 years ago

figuring out how to install htslib for CI.

Yay! And...yikes. Does it make sense to fix this upstream at https://github.com/brentp/cyvcf2/issues/114?

Also as a reminder @horta pointed us to https://github.com/joerick/cibuildwheel as a tool he's used for this purpose.

tomwhite commented 4 years ago

I was thinking of doing it in two stages so we can use it/iterate on it sooner.

  1. Use the debian htslib packages (like cyvcf2 does): https://github.com/brentp/cyvcf2/blob/master/.travis.yml
  2. Build wheels (thanks for the refs @hammer)
tomwhite commented 4 years ago

Fixed in https://github.com/pystatgen/sgkit-vcf/pull/1