related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

Add bgen-reader backend #38

Closed tomwhite closed 4 years ago

tomwhite commented 4 years ago

The code in #36 uses PyBGEN to read BGEN files. This (incomplete) PR adds a bgen-reader so we can compare the two libraries. A few differences/comments:

The main problem with this code as it stands is that it doesn't create one of the genetic datasets defined in https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/data_structures.ipynb.

Each variant/sample has three probabilities (AA, AB, BB), but none of the data structures support that, unless I'm missing something. GenotypeProbabilityDataset looks like it's the closest, but it looks like it requires phased data, since there are four allele/ploidy combinations. Alistair's comment in https://discourse.smadstatgen.org/t/n-d-array-conventions-for-variation-data/16/8 seems relevant:

One way of representing genotype likelihoods would be as a 3D array of floats, where the first dimension was variants, the second dimension was samples, and third dimension was the number of possible genotypes.

I wonder if we should add support for this data representation too.

Thoughts @eric-czech?

tomwhite commented 4 years ago

One other thing I noticed that may be useful in the future is that bgen-reader supports variable ploidy and number of alleles, unlike PyBGEN which is restricted to diploid, bi-allelic BGEN files.

eric-czech commented 4 years ago

Very nice! Two options that come to mind for the representation are:

  1. Treat bgen more like VCF as a container of multiple types of different datasets where neither have a model type that we try to assert structure/content for
    • The genotype probabilities are rarely used directly afaik, and are instead converted to either hard calls (using max probability) or dosages as P(AB) + 2 * P(BB) so the result from your io backend could return all 3. If the transformed arrays are dask arrays then presumably this wouldn't be inefficient if they're never used. This is the approach Hail takes in import_bgen.
  2. Convert GenotypeProbabilityDataset to have 3 dimensions matching what gets stored in BGEN, rather than 4 like I originally proposed. I only did that originally based on this discussion, where I'm saying that less space is needed in a non-combinatorial representation with high ploidy organisms. It's probably easier to just do what everyone else does though and assume there will be elements in the array for each joint probability.
    • Going this route, the conversion to something more useful would be like core.create_genotype_probability_dataset(vars).to.genotype_dosage_dataset()
    • How do you feel about that?
tomwhite commented 4 years ago

Thanks for setting out some options @eric-czech. I've tried approach 2 in this latest update.

I added a GenotypeProbabilityAltDataset type of shape (variants, samples, genotypes), i.e. the last dimension is 3. Now when loading using bgen-reader we get the probabilities, and these can be converted to dosages with a single call to.genotype_dosage_dataset(). I think it's best to do it this way, i.e. load what's in the BGEN file (probabilities) and have the user explicitly convert to dosages if they want them. (I need to change the PyBGEN reader to be consistent with this.)

If we go this route, we should probably just change GenotypeProbabilityDataset to be what I've called GenotypeProbabilityAltDataset.

Another thing I noticed is that to.genotype_dosage_dataset() loses variant and sample variables, which seems like a bug.

eric-czech commented 4 years ago

I think it's best to do it this way, i.e. load what's in the BGEN file (probabilities) and have the user explicitly convert to dosages if they want them.

👍

If we go this route, we should probably just change GenotypeProbabilityDataset to be what I've called GenotypeProbabilityAltDataset

Do we want to just do away with those wrappers now? My intention with them was mostly to organize code and attach attributes for later optimizations that aren't really necessary yet (or maybe ever), and certainly not to actually override internal Xarray methods, so perhaps this reader would be a good example for what reading in dosage data or hard calls would look like otherwise?

tomwhite commented 4 years ago

Do we want to just do away with those wrappers now?

Yes, I think so. So in this case the read_bgen function would return a (DIM_VARIANT, DIM_SAMPLE, DIM_GENOTYPE) sized array (as an xarray obviously), containing the probabilities. And there would be a function for converting to dosages. Does that sound about right?

eric-czech commented 4 years ago

Yes, I think so. So in this case the read_bgen function would return a (DIM_VARIANT, DIM_SAMPLE, DIM_GENOTYPE) sized array (as an xarray obviously), containing the probabilities. And there would be a function for converting to dosages. Does that sound about right?

Yep. Some things that are probably worth thinking about though if the datasets are returned from readers as-is:

tomwhite commented 4 years ago

Should they have coordinates?

If we have them, then it's a good idea to add them I think.

For the other points, I'm not sure what the best option is yet. I suggest we work through them as a part of refactoring the API.

I think removing the wrappers is big enough to warrant a separate issue and PR(s). This is basically an additive change, so we could merge it in the meantime.

eric-czech commented 4 years ago

I think removing the wrappers is big enough to warrant a separate issue and PR(s). This is basically an additive change, so we could merge it in the meantime.

👍 , will do. I'll add an issue for it.