samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
653 stars 174 forks source link

how to encode missing fields with number>1 #419

Open yfarjoun opened 5 years ago

yfarjoun commented 5 years ago

This can happen in either INFO or FORMAT when an array is missing. For example a missing PL in the format field in a diploid, biallelic site can be a . or .,.,. which one is correct? which one is valid? the text is somewhat vague and the example provided only covers the case of number=1.

pd3 commented 5 years ago

Both can be used. Single . for brevity or .,.,. to express ploidy.

lbergelson commented 5 years ago

It seems like . is probably usually preferable for non-GT missing fields.

Does htslib have reasonable support for dealing with partially missing arrays?

A related issue: Currently htsjdk doesn't handle things non-genotypes with partially missing values well if at all.

At the moment things like AD = 5:.:10 are treated the same as . which seems wrong. One issue with representing these sorts of things correctly is that java primitive arrays don't support null values.

Does anyone know how partially missing arrays are handled in htlib?

pd3 commented 5 years ago

Partially missing arrays are fully handled in htslib. They were added later, and at that point the java implementation made the pragmatic decision to treat partially missing arrays as fully missing. (Which I understand because it can be quite a pain sometimes.)

d-cameron commented 6 months ago

Specs-as-written, both are fine.

Section 1.6.2

If a field contains a list of missing values, it can be represented either as a single MISSING value (`.') or as a list of missing values (e.g.\ `.,.,.' if the field was Number=3).