Add bgen-reader backend

tomwhite commented 4 years ago

The code in #36 uses PyBGEN to read BGEN files. This (incomplete) PR adds a bgen-reader so we can compare the two libraries. A few differences/comments:

PyGEN is pure Python, bgen-reader is a Python wrapper around a C implementation.
bgen-reader doesn't use the bgenix index, it creates a metafile instead, which fulfils a similar role. It would be nice if it used a bgenix index since these are standard (e.g. UKBB ships the index with its BGEN files).
I mistakenly said (in #36) that bgen-reader opens the file every time it reads a variant. This is wrong - it simply seeks to the relevant point in the file, just like PyBGEN.
I found bgen-reader's API impossible to use in an efficient way since it seems geared toward random access, rather than sequential (parallel) access. In this PR I am using protected parts of bgen-reader. Ideally, we would improve its API if we decide to support this code.
I haven't been able to compare the performance of the two yet, but I would like to when we need to load bigger datasets.
bgen-reader seems to have better support for reading samples from a separate .sample side file.

The main problem with this code as it stands is that it doesn't create one of the genetic datasets defined in https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/data_structures.ipynb.

Each variant/sample has three probabilities (AA, AB, BB), but none of the data structures support that, unless I'm missing something. GenotypeProbabilityDataset looks like it's the closest, but it looks like it requires phased data, since there are four allele/ploidy combinations. Alistair's comment in https://discourse.smadstatgen.org/t/n-d-array-conventions-for-variation-data/16/8 seems relevant:

One way of representing genotype likelihoods would be as a 3D array of floats, where the first dimension was variants, the second dimension was samples, and third dimension was the number of possible genotypes.

I wonder if we should add support for this data representation too.

Thoughts @eric-czech?

tomwhite commented 4 years ago

One other thing I noticed that may be useful in the future is that bgen-reader supports variable ploidy and number of alleles, unlike PyBGEN which is restricted to diploid, bi-allelic BGEN files.

eric-czech commented 4 years ago

Very nice! Two options that come to mind for the representation are:

Treat bgen more like VCF as a container of multiple types of different datasets where neither have a model type that we try to assert structure/content for
- The genotype probabilities are rarely used directly afaik, and are instead converted to either hard calls (using max probability) or dosages as P(AB) + 2 * P(BB) so the result from your io backend could return all 3. If the transformed arrays are dask arrays then presumably this wouldn't be inefficient if they're never used. This is the approach Hail takes in import_bgen.
Convert GenotypeProbabilityDataset to have 3 dimensions matching what gets stored in BGEN, rather than 4 like I originally proposed. I only did that originally based on this discussion, where I'm saying that less space is needed in a non-combinatorial representation with high ploidy organisms. It's probably easier to just do what everyone else does though and assume there will be elements in the array for each joint probability.
- Going this route, the conversion to something more useful would be like core.create_genotype_probability_dataset(vars).to.genotype_dosage_dataset()
- How do you feel about that?

tomwhite commented 4 years ago

Thanks for setting out some options @eric-czech. I've tried approach 2 in this latest update.

I added a GenotypeProbabilityAltDataset type of shape (variants, samples, genotypes), i.e. the last dimension is 3. Now when loading using bgen-reader we get the probabilities, and these can be converted to dosages with a single call to.genotype_dosage_dataset(). I think it's best to do it this way, i.e. load what's in the BGEN file (probabilities) and have the user explicitly convert to dosages if they want them. (I need to change the PyBGEN reader to be consistent with this.)

If we go this route, we should probably just change GenotypeProbabilityDataset to be what I've called GenotypeProbabilityAltDataset.

Another thing I noticed is that to.genotype_dosage_dataset() loses variant and sample variables, which seems like a bug.

eric-czech commented 4 years ago

I think it's best to do it this way, i.e. load what's in the BGEN file (probabilities) and have the user explicitly convert to dosages if they want them.

👍

If we go this route, we should probably just change GenotypeProbabilityDataset to be what I've called GenotypeProbabilityAltDataset

Do we want to just do away with those wrappers now? My intention with them was mostly to organize code and attach attributes for later optimizations that aren't really necessary yet (or maybe ever), and certainly not to actually override internal Xarray methods, so perhaps this reader would be a good example for what reading in dosage data or hard calls would look like otherwise?

tomwhite commented 4 years ago

Do we want to just do away with those wrappers now?

Yes, I think so. So in this case the read_bgen function would return a (DIM_VARIANT, DIM_SAMPLE, DIM_GENOTYPE) sized array (as an xarray obviously), containing the probabilities. And there would be a function for converting to dosages. Does that sound about right?

eric-czech commented 4 years ago

Yes, I think so. So in this case the read_bgen function would return a (DIM_VARIANT, DIM_SAMPLE, DIM_GENOTYPE) sized array (as an xarray obviously), containing the probabilities. And there would be a function for converting to dosages. Does that sound about right?

Yep. Some things that are probably worth thinking about though if the datasets are returned from readers as-is:

Should they have coordinates? I decided to add those early on in the core.create functions to facilitate indexing / selecting data, but it could be left up to the user or forced in by functions that need them.
Should the strategy for representing missing values in floating point data like this be the same as the strategy from other readers of integer types? If we go the scikit-allele sentinel + boolean mask route then probably not but if we use masked Dask arrays then it likely makes more sense for the reader to be responsible for creating them.
How should readers standardize on representations of phased genotypes? For this case, I'm not sure what would happen if one of the probability blocks was phased (an error on np.stack maybe?). As far as I've seen, phasing could be specific to only variants, variants + samples, or an entire dataset so it may make sense for readers to return a 1D array, a 2D array, or global attributes (whatever is most appropriate). I initially went the Hail route and assumed it would always be a 2D bool array or nothing.

tomwhite commented 4 years ago

Should they have coordinates?

If we have them, then it's a good idea to add them I think.

For the other points, I'm not sure what the best option is yet. I suggest we work through them as a part of refactoring the API.

I think removing the wrappers is big enough to warrant a separate issue and PR(s). This is basically an additive change, so we could merge it in the meantime.

eric-czech commented 4 years ago

I think removing the wrappers is big enough to warrant a separate issue and PR(s). This is basically an additive change, so we could merge it in the meantime.

👍 , will do. I'll add an issue for it.

related-sciences / gwas-analysis

Add bgen-reader backend #38