sgkit-dev / vcf-zarr-spec

VCF Zarr Specification
Apache License 2.0
11 stars 2 forks source link

Add top-level attribute to denote default phased/unphased status #2

Open jeromekelleher opened 2 years ago

jeromekelleher commented 2 years ago

Currently we assume genotypes are unphased if the phased marker isn't present. However, it's a pretty common case I'd imagine that all genotypes are either phased or unphased, so requiring the extra storage in the phased case seems excessive. Also, we don't want to have to go through everything to see if the data is all phased.

So, how about we have a top-level field which tells us the default phased-ness?

tomwhite commented 2 years ago

This would be useful to add.

Should it be required to specify this to indicate default phased-ness or would it be permitted to have an array full of true values (to indicate phased) or false (unphased)? In terms of implementation, when converting a VCF file we don't in general know if it is phased or not, so we'd have to generate the phased array, and then throw it away if all entries were true or false.

Would it be an error to specify both the attribute and the array?

jeromekelleher commented 2 years ago

Hmm, it is tricky all right. I guess in retrospect the actual amount of storage required for an array of all 0s or all 1s is going to be pretty small, so perhaps it's not worth worrying about this. If we start summarising this at the file-level then why not summarise a bunch of other things.