sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
24 stars 6 forks source link

Implement mixed_ploidy support #179

Closed jeromekelleher closed 4 months ago

jeromekelleher commented 4 months ago

The main thing we're missing from sgkit's VCF implementation is support for mixed ploidy. Currently we're assuming that any missing call corresponds to a mixed ploidy, but isn't what we want I think.

jeromekelleher commented 4 months ago

I've been going backwards and forwards on whether this is something we want to do in vcf2zarr, as it revolves around the interpretation of partial calls, rather than the actual data representation.

I think bio2zarr is doing the correct thing at the moment in that it is representing the data in the input VCF in a 1-1 way: if it contains partial calls, these are represented by dimension padding, and can be losslessly recovered by VCF output. How partial calls should be interpreted is context specific, depending on the variant caller and the species in question. So, in my view, the VCF Zarr spec (and vcf2zarr) should stay out of the business of interpretation and just focus on losslessly encoding the data in the input. Downstream tools like sgkit can provide functionality to aid in interpretation by computing and storing permanent mask arrays or dealing with partial calls as missing or mixed ploidy on the fly.

Additionally, there's all sorts of thorny issues to deal with when we could have partial genotype calls, but complete GQ values etc. I would rather keep this can of worms closed, and let it be dealt with on a case-by-case basis at analysis time.

What do you think @timothymillar - is this a reasonable line of responsibility to draw? VCF Zarr doesn't have opinions about how you interpret partial calls, it just represents them faithfully?

timothymillar commented 4 months ago

@jeromekelleher I don't quite understand the issue you're describing. VCF explicitly differentiates between partial calls and variable ploidy in a way that can be losslessly represented in Zarr/ndarrays. For example, the following snippet includes a diploid call, a partial tetraploid call, and a complete tetraploid call:

FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
GT  0/1 0/1/./. 0/1/1/1

This can be converted to the call_genotype array:

[[0, 1, -2, -2], [0, 1, -1, -1], [0, 1, 1, 1]]

Infact, I would argue that VCF is inherently a mixed-ploidy format as there is no standard for globally specifying a single ploidy level.

dealing with partial calls as missing or mixed ploidy on the fly

I don't think that partial calls are the issue here, the main issue is that many software packages don't follow the VCF spec with regard to completely missing calls. It's common practice to represent a missing call as a single . irrespective of ploidy level. Under a strict interpretation of the VCF spec, a single . should be interpreted as a haploid with unknown genotype (an unknown diploid call would be ./.). So, defaulting to a strict interpretation of the VCF spec will cause issues in practice.

The current vcf_to_zarr function in sgkit will, by default, convert lower-ploidy calls to partial-calls of the specified ploidy (e.g., 0/1 will be treated as 0/1/./. when ploidy=4). This maintains backwards compatibility (prior to supporting mixed-ploidy), and it also handles . calls in a way that most users would expect (. is treated as ./. when ploidy=2). Specifying mixed_ploidy=True results in a strict (correct) interpretation of the VCF spec. So, the current status quo is that we default to doing the wrong thing because it matches user expectations.

Additionally, when specifying mixed_ploidy=True, we also add a mixed_ploidy=True attribute to the call_genotype array. This was intended as a simple way to differentiate between mixed-ploidy and fixed-ploidy array for the sake of method dispatch. It seemed like a better option than having to check the ploidy of every genotype call. But perhaps we should review how necessary this attribute is now that the Zarr spec is more concrete?

timothymillar commented 4 months ago

I guess one issue with our default behavior is that, in the presence of . calls, a round-trip doesn't result in an identical VCF. But this is because we are correcting the input VCF. This fits with the principle of flexible input and strict output.

we could have partial genotype calls, but complete GQ values

Could you elaborate on this? I don't see how GQ values can be affected by the interpretation of partial calls.

jeromekelleher commented 4 months ago

That's very helpful @timothymillar, thanks. I'm in agreement with you on the correctness point, I think the only question is whether we should be baking in corrections to fit people's expectations into the specification, or we should make the features you discuss part of upstream software like sgkit? So, not something that we do as part of the actual conversion process, but something that we offer as a dataset update operation.

I think it would be cleaner and simpler if we just let the spec describe the data, and leave issues of interpretation to higher level tools. Currently we just say

A value of -1 indicates missing, and -2 indicates fill in mixed-ploidy datasets.

without defining what a "mixed ploidy dataset" means, and what the interpretation of "." vs "./." should be.

FWIW, it seems that cyvcf2 automatically does the same thing as the correct mixed ploidy interpretation, so I'm not sure the behaviour will inconvenience users much in practise.

jeromekelleher commented 4 months ago

Note that vcf2zarr detects ploidy automatically as the maximum dimension of a genotype call. Generally it's much better at just converting the data as it is without expecting so much user knowledge and intervention, because of the two-pass approach.

timothymillar commented 4 months ago

I have no complaint about defaulting to a strict interpretation of the VCF spec (it's more convenient for me!). We just need to think about where it might be an issue in sgkit methods. But I suspect that it'll mostly go unnoticed as a [-1,-2] call will generally have the same result as a [-1,-1] call. This may allow us to deprecate the mixed_ploidy attribute of the call_genotype array, which would be nice.

If it does become a common issue, we could add a force_fixed_ploidy argument to treat all samples as the maximum ploidy of the dataset. This may be more convenient and efficient than manually changing it in sgkit/xarray.

without defining what a "mixed ploidy dataset" means, and what the interpretation of "." vs "./." should be.

I agree that the current wording isn't ideal. It sounds like mixed ploidy datasets are treated separately in the spec, which they shouldn't be. I would suggest simplifying to "A value of -1 indicates missing, and -2 indicates fill value" with a footnote noting that VCF and VCF-Zarr allow for variable ploidy levels.

cyvcf2 automatically does the same thing as the correct mixed ploidy interpretation

That was us! https://github.com/brentp/cyvcf2/pull/171. Pysam also differentiates correctly. 0/1 results in (0, 1) and 0/1/./. results in (0, 1, None, None).

Note that vcf2zarr detects ploidy automatically as the maximum dimension of a genotype call

Nice, that will simplify things for me!

jeromekelleher commented 4 months ago

That was us! brentp/cyvcf2#171. Pysam also differentiates correctly. 0/1 results in (0, 1) and 0/1/./. results in (0, 1, None, None).

Haha, I was wondering how it all worked so smoothly without me thinking about it! Nice job @timothymillar!

OK, I'm going to mark this as a wontfix, and open an issue on the spec repo to track updating the wording around padding.