sgkit-dev / vcf-zarr-spec

VCF Zarr Specification
Apache License 2.0
11 stars 2 forks source link

Clarify status of negative integers in spec #12

Open jeromekelleher opened 7 months ago

jeromekelleher commented 7 months ago

We use -1 to denote missing data and -2 to denote dimension padding for integer fields. This means that we cannot use these as normal values in the VCF.

Clarify what users and implementations are supposed to do in the event negative integer data.

jeromekelleher commented 7 months ago

A reasonable thing to do may be to say "we don't support negative intetgers use a Float instead". That would probably lead to fewer errors, as I think the assumption that missing data is < 0 is baked in pretty well to sgkit now (and it's a useful assumption!)

tomwhite commented 7 months ago

We've used the float values from BCF for missing and fill, so perhaps we should consider what it would take to align with the BCF encoding for missing and fill values for integers?

These are dependent on the width of the integer, so for 8-bit ints missing is 0x80, whereas for 16-bit ints it is 0x8000, but they still satisfy < 0 (since they are signed ints).

jeromekelleher commented 7 months ago

I think the approach we have take is is less likely to result in user-error, because distinguishing between missing and pad values requires knowing the type, and to me that would be a pernicious source of bugs in code that assume (e.g.) that genotypes are always 8 bits. Using < 0 as a blanket missingness mask seems wasteful, and would also lead to subtle bugs where missingness and padding are mixed up (e.g., when calculating the denominator in some stats).

That's how I would argue the point for the paper anyway.

timothymillar commented 7 months ago

I'm a bit torn about this but I think Jerome is correct regarding user error. Maybe we can eventually move to a custom vcfint dtype (see NEP42). That may allow for a user-safe implementation of the BCF sentinel values and a safe migration path. Depending on compatibility with Zarr.

jeromekelleher commented 7 months ago

I think we need to think more widely beyond Numpy and python here as well - zarr is cross language, so we want code that can be compiled with these sentinels and be correct. In this context fixed values across integer types is much safer.