samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 173 forks source link

Should we enable SAMPLE annotation in VCF? #314

Closed yfarjoun closed 6 years ago

yfarjoun commented 6 years ago

Currently in order to provide information (phenotype, gender, pedigree, cohort, sequencing protocol, etc.) about the samples in your vcf you need to resort to an external file (PED or FAM for example, or roll-your-own.) This seems like an oversight that could be addressed in the VCF spec.

We could add header lines that include per-sample information, for example:

##SAMPLE=<ID=GENDER,Type=Integer,Description="A flag which is 1 for male 2 for female and 9 for unknown">1<TAB>2<TAB>2<TAB>1<TAB>1
##SAMPLE=<ID=HEIGHT,Type=Float,Description="the height of the individual, in meters">1.71<TAB>2.01<TAB>1.64<TAB>1.57<TAB>1.85
##SAMPLE=<ID=Cohort,Type=String,Description="The cohort in which the sample was recruited">1000Genomes<TAB>HapMap<TAB>Diabetes<TAB>All of Us<TAB>UKBB

By putting the sample-level information into the VCF, this would enable tools that change the sample-list (merging vcfs or selecting samples) to modify the sample-level information at the same time, which would be safer than doing it in two separate steps (modify the vcf(s) and then modify the meta-data files accordingly)

I'm not married to the format I proposed above, but I wanted to give a definite proposal to start the discussion...

Any opinions?

jmarshall commented 6 years ago

This seems pretty similar to the ##META and ##SAMPLE lines defined by example in VCFv4.3 §1.4.8 Sample field format… maybe what's needed is some more detail in that section!

yfarjoun commented 6 years ago

ah, right...while it's not in the preferred orientation (I'd prefer long lines than many lines...) it will make do.