Feature Request: structured data in VCF:INFO (XML, JSON ?)

samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats

http://samtools.github.io/hts-specs/

632 stars 174 forks source link

Feature Request: structured data in VCF:INFO (XML, JSON ?) #75

Closed lindenb closed 1 year ago

lindenb commented 9 years ago

(cross posted on http://sourceforge.net/p/vcftools/mailman/message/33615341/ )

I wish I could insert structured data in a VCF, like those produced by VEP+REST, EVS+SOAP , etc... (see above). Would it be possible to store XML||JSON in addition to the already existing types: Integer, Float, Flag, Character, and String. Trying to store the data in a simple string is a nonsense to me

CSQ=G|ENSG00000223972|ENST00000456328|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|1413|||||||1|DDX11L1|HGNC|37102|||,G|ENSG00000227232|ENST00000541675|Transcript|downstream_gene_variant|||||||198|-1|WASH7P|HGNC|38034|||,G|653635|NR_024540.1|Transcript|downstream_gene_variant|||||||197|-1|WASH7P||38034|||,G|ENSG00000223972|ENST00000515242|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|1406|||||||1|DDX11L1|HGNC|37102|||,G|ENSESTG00000013896|ENSESTT00000034761|Transcript|downstream_gene_variant|||||||242|-1||||||,G|100287102|NR_046018.2|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|1408|||||||1|DDX11L1||37102|||,G|ENSG00000227232|ENST00000538476|Transcript|downstream_gene_variant|||||||246|-1|WASH7P|HGNC|38034|||,G|ENSG00000223972|ENST00000518655|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|1239|||||||1|DDX11L

A problem that would arise would be , how can we store those documents containing spaces.

Pierre

VEP+ XML:

<?xml version="1.0"?>
<opt>
  <data id="rs148327885" allele_string="C/T" assembly_name="GRCh37" end="878331" input="1 878331 rs148327885 C T . . ." most_severe_consequence="missense_variant" seq_region_name="1" start="878331" strand="1">
    <colocated_variants id="rs148327885" aa_allele="T" aa_maf="0.001453" allele_string="C/T" amr_allele="T" amr_maf="0.02" asn_allele="T" asn_maf="0.0035" ea_allele="T" ea_maf="0.010049" end="878331" eur_allele="T" eur_maf="0.01" minor_allele="T" minor_allele_freq="0.0087" seq_region_name="1" somatic="0" start="878331" strand="1"/>
    <transcript_consequences biotype="protein_coding" distance="3660" gene_id="ENSG00000187634" gene_symbol="SAMD11" gene_symbol_source="HGNC" hgnc_id="28706" protein_id="ENSP00000411579" strand="1" transcript_id="ENST00000420190" variant_allele="T">
      <consequence_terms>downstream_gene_variant</consequence_terms>
    </transcript_consequences>
     (...)
  </data>
</opt>

or JSON:

[
    {
        "allele_string": "C/T", 
        "assembly_name": "GRCh37", 
        "colocated_variants": [
            {
                "aa_allele": "T", 
                "aa_maf": 0.0014530000000000001, 
                "allele_string": "C/T", 
                "amr_allele": "T", 
                "amr_maf": 0.02, 
                "asn_allele": "T", 
                "asn_maf": 0.0035000000000000001, 
                "ea_allele": "T", 
                "ea_maf": 0.010049000000000001, 
                "end": 878331, 
                "eur_allele": "T", 
                "eur_maf": 0.01, 
                "id": "rs148327885", 
                "minor_allele": "T", 
                "minor_allele_freq": 0.0086999999999999994, 
                "seq_region_name": "1", 
                "somatic": 0, 
                "start": 878331, 
                "strand": 1
            }
        ], 
        "end": 878331, 
        "id": "rs148327885", 
        "input": "1 878331 rs148327885 C T . . .", 
        "most_severe_consequence": "missense_variant", 
        "seq_region_name": "1", 
        "start": 878331, 
        "strand": 1, 
        "transcript_consequences": [
            {
                "biotype": "protein_coding", 
                "consequence_terms": [
                    "downstream_gene_variant"
                ], 
                "distance": 3660, 
                "gene_id": "ENSG00000187634", 
                "gene_symbol": "SAMD11", 
                "gene_symbol_source": "HGNC", 
                "hgnc_id": 28706, 
                "protein_id": "ENSP00000411579", 
                "strand": 1, 
                "transcript_id": "ENST00000420190", 
                "variant_allele": "T"
            },

or EVS output :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:local xmlns:ns2="http://webservice.evs.gs.washington.edu/" xmlns:ns3="uri">
<chromosome>1</chromosome>
<start>120457968</start>
<stop>120457969</stop>
<strand>+</strand>
<snpList>
<positionString>1:120457968</positionString>
<chrPosition>120457968</chrPosition>
<alleles>T/C</alleles>
<uaAlleleCounts>1/2701</uaAlleleCounts>
<aaAlleleCounts>0/2176</aaAlleleCounts>
<totalAlleleCounts>1/4877</totalAlleleCounts>
<uaAlleleAndCount>T=1/C=2701</uaAlleleAndCount>
<aaAlleleAndCount>T=0/C=2176</aaAlleleAndCount>
<totalAlleleAndCount>T=1/C=4877</totalAlleleAndCount>
<uaMAF>0.037</uaMAF>
<aaMAF>0.0</aaMAF>
<totalMAF>0.0205</totalMAF>
<avgSampleReadDepth>198</avgSampleReadDepth>
<geneList>NOTCH2</geneList>
<snpFunction>
<chromosome>1</chromosome>
<position>120457968</position>
<conservationScore>1.0</conservationScore>
<conservationScoreGERP>5.5</conservationScoreGERP>
<snpFxnList>
<mrnaAccession>NM_024408</mrnaAccession>
<fxnClassGVS>missense</fxnClassGVS>
<aminoAcids>MET,ILE</aminoAcids>
<proteinPos>2459/2472</proteinPos>
<cdnaPos>7377</cdnaPos>
<pphPrediction>unknown</pphPrediction>
<granthamScore>10</granthamScore>
</snpFxnList>
<refAllele>C</refAllele>
<ancestralAllele>C</ancestralAllele>
<firstRsId>0</firstRsId>
<secondRsId>0</secondRsId>
<filters>PASS</filters>
<clinicalLink>unknown</clinicalLink>
</snpFunction>
<conservationScore>1.0</conservationScore>
<conservationScoreGERP>5.5</conservationScoreGERP>
<refAllele>C</refAllele>
<altAlleles>T</altAlleles>
<ancestralAllele>C</ancestralAllele>
<chromosome>1</chromosome>
<hasAtLeastOneAccession>true</hasAtLeastOneAccession>
<rsIds>none</rsIds>
<filters>PASS</filters>
<clinicalLink>unknown</clinicalLink>
</snpList>

LeeTL1220 commented 9 years ago

This seems like it would be pretty arduous to support for tool developers. Can there be a larger discussion?

pd3 commented 9 years ago

@LeeTL1220 Agreed, and this is the right place (and time?) to discuss it.

There has been a couple of emails exchanged about this also on the vcftools-spec mailing list, where Pierre proposed a more compact variant:

 CSQ=(genes((name "my gene1") (chromStart 9)) ((name "my gene2")  (chromStart 10)))

tfenne commented 9 years ago

I would really like to see support in the VCF spec for encoding/storing JSON in the INFO field. I think JSON has the right combination of terseness, no specific requirements for whitespacing (e.g. like YAML), and complexity (or lack thereof).

I don't think that this need become a burden to tool/API developers. The INFO field already supports storing arbitrarily complex strings that have their own pseudo-encoding, but there's no expectation that tool/API developers write code that parses all of these custom formats. In fact I think supporting a single well-defined mechanism for storing structured data in VCF would lessen the burden on tool developers as eventually everyone would converge on using that format instead of home-grown version.

VEP, SNPEff and another functional annotators make the need for this very clear. I've spent a decent amount of time trying to think about creating either a standard or best practices for storing basic functional annotation data in VCF, and each time arrive at the conclusion that without a better way to store structured information it's just not worth it.

What would be the barriers to storing JSON in the INFO field? The only major one that jumps out to me is the possible need to have spaces in JSON (to supported quoted strings included in JSON).

lindenb commented 9 years ago

I've suggested before: a simple(?) way to specifiy the existence of a structured data (JSON or XML) (I'm not sure that creating another new format would be a good idea) would be to encode the document using base64 ( http://en.wikipedia.org/wiki/Base64 ) encoding and to specify it in the ##INFO header.

something like ContentType="text/xml"

##INFO=<ID=EFF,ContentType="text/xml",Number=.,Type=String,Description="Predicted effects for this variant...'",>
(...)
1   878331  .   C   A,T .   .   EFF=PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9InllcyI/Pgo8bnMzOmxvY2FsIHhtbG5zOm5zMj0iaHR0cDovL3dlYnNlcnZpY2UuZXZzLmdzLndhc2hpbmd0b24uZWR1LyIgeG1sbnM6bnMzPSJ1cmkiPgogIDxjaHJvbW9zb21lPjE8L2Nocm9tb3NvbWU+CiAgPHN0YXJ0PjEyMDQ1Nzk2ODwvc3RhcnQ+CiAgPHN0b3A+MTIwNDU3OTY5PC9zdG9wPgogIDxzdHJhbmQ+Kzwvc3RyYW5kPgogIDxzbnBMaXN0PgogICAgPHBvc2l0aW9uU3RyaW5nPjE6MTIwNDU3OTY4PC9wb3NpdGlvblN0cmluZz4KICAgIDxjaHJQb3NpdGlvbj4xMjA0NTc5Njg8L2NoclBvc2l0aW9uPgogICAgPGFsbGVsZXM+VC9DPC9hbGxlbGVzPgogICAgPHVhQWxsZWxlQ291bnRzPjEvMjcwMTwvdWFBbGxlbGVDb3VudHM+CiAgICA8YWFBbGxlbGVDb3VudHM+MC8yMTc2PC9hYUFsbGVsZUNvdW50cz4KICAgIDx0b3RhbEFsbGVsZUNvdW50cz4xLzQ4Nzc8L3RvdGFsQWxsZWxlQ291bnRzPgogICAgPHVhQWxsZWxlQW5kQ291bnQ+VD0xL0M9MjcwMTwvdWFBbGxlbGVBbmRDb3VudD4KICAgIDxhYUFsbGVsZUFuZENvdW50PlQ9MC9DPTIxNzY8L2FhQWxsZWxlQW5kQ291bnQ+CiAgICA8dG90YWxBbGxlbGVBbmRDb3VudD5UPTEvQz00ODc3PC90b3RhbEFsbGVsZUFuZENvdW50PgogICAgPHVhTUFGPjAuMDM3PC91YU1BRj4KICAgIDxhYU1BRj4wLjA8L2FhTUFGPgogICAgPHRvdGFsTUFGPjAuMDIwNTwvdG90YWxNQUY+CiAgICA8YXZnU2FtcGxlUmVhZERlcHRoPjE5ODwvYXZnU2FtcGxlUmVhZERlcHRoPgogICAgPGdlbmVMaXN0Pk5PVENIMjwvZ2VuZUxpc3Q+CiAgICA8c25wRnVuY3Rpb24+CiAgICAgIDxjaHJvbW9zb21lPjE8L2Nocm9tb3NvbWU+CiAgICAgIDxwb3NpdGlvbj4xMjA0NTc5Njg8L3Bvc2l0aW9uPgogICAgICA8Y29uc2VydmF0aW9uU2NvcmU+MS4wPC9jb25zZXJ2YXRpb25TY29yZT4KICAgICAgPGNvbnNlcnZhdGlvblNjb3JlR0VSUD41LjU8L2NvbnNlcnZhdGlvblNjb3JlR0VSUD4KICAgICAgPHNucEZ4bkxpc3Q+CiAgICAgICAgPG1ybmFBY2Nlc3Npb24+Tk1fMDI0NDA4PC9tcm5hQWNjZXNzaW9uPgogICAgICAgIDxmeG5DbGFzc0dWUz5taXNzZW5zZTwvZnhuQ2xhc3NHVlM+CiAgICAgICAgPGFtaW5vQWNpZHM+TUVULElMRTwvYW1pbm9BY2lkcz4KICAgICAgICA8cHJvdGVpblBvcz4yNDU5LzI0NzI8L3Byb3RlaW5Qb3M+CiAgICAgICAgPGNkbmFQb3M+NzM3NzwvY2RuYVBvcz4KICAgICAgICA8cHBoUHJlZGljdGlvbj51bmtub3duPC9wcGhQcmVkaWN0aW9uPgogICAgICAgIDxncmFudGhhbVNjb3JlPjEwPC9ncmFudGhhbVNjb3JlPgogICAgICA8L3NucEZ4bkxpc3Q+CiAgICAgIDxyZWZBbGxlbGU+QzwvcmVmQWxsZWxlPgogICAgICA8YW5jZXN0cmFsQWxsZWxlPkM8L2FuY2VzdHJhbEFsbGVsZT4KICAgICAgPGZpcnN0UnNJZD4wPC9maXJzdFJzSWQ+CiAgICAgIDxzZWNvbmRSc0lkPjA8L3NlY29uZFJzSWQ+CiAgICAgIDxmaWx0ZXJzPlBBU1M8L2ZpbHRlcnM+CiAgICAgIDxjbGluaWNhbExpbms+dW5rbm93bjwvY2xpbmljYWxMaW5rPgogICAgPC9zbnBGdW5jdGlvbj4KICAgIDxjb25zZXJ2YXRpb25TY29yZT4xLjA8L2NvbnNlcnZhdGlvblNjb3JlPgogICAgPGNvbnNlcnZhdGlvblNjb3JlR0VSUD41LjU8L2NvbnNlcnZhdGlvblNjb3JlR0VSUD4KICAgIDxyZWZBbGxlbGU+QzwvcmVmQWxsZWxlPgogICAgPGFsdEFsbGVsZXM+VDwvYWx0QWxsZWxlcz4KICAgIDxhbmNlc3RyYWxBbGxlbGU+QzwvYW5jZXN0cmFsQWxsZWxlPgogICAgPGNocm9tb3NvbWU+MTwvY2hyb21vc29tZT4KICAgIDxoYXNBdExlYXN0T25lQWNjZXNzaW9uPnRydWU8L2hhc0F0TGVhc3RPbmVBY2Nlc3Npb24+CiAgICA8cnNJZHM+bm9uZTwvcnNJZHM+CiAgICA8ZmlsdGVycz5QQVNTPC9maWx0ZXJzPgogICAgPGNsaW5pY2FsTGluaz51bmtub3duPC9jbGluaWNhbExpbms+CiAgPC9zbnBMaXN0Pgo8L25zMzpsb2NhbD4K

d-cameron commented 9 years ago

There was discussion back in the middle of last year on the VCFtools-spec mailing list regarding allowing INFO/FORMAT fields to represent strings with reserved characters. A number of escape/encoding schemes were proposed but it looks none of proposed schemes made it into the v4.3 draft. If the string field were able to encode/decode reserved characters (eg percent encoding of all VCF special characters), then representing JSON/XML in VCF would be trivial.

On 3/06/2015 11:37 PM, Pierre Lindenbaum wrote:

I've suggested before: a simple(?) way to specifiy the existence of a structured data (JSON or XML) (I'm not sure that creating another new format would be a good idea) would be to encode the document using base64 ( http://en.wikipedia.org/wiki/Base64 ) encoding and to specify it in the |##INFO| header.

something like ContentType="text/xml"

##INFO=<ID=EFF,ContentType="text/xml",Number=.,Type=String,Description="Predicted effects for this variant...'",>

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/75#issuecomment-108416568.

The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender.

pd3 commented 9 years ago

@d-cameron The draft does say that characters with special meaning (such as ';' in INFO, ':' in FORMAT, and '\%' in both) can be encoded using URL encoding.

willmclaren commented 9 years ago

Hi all, coming to this a little late. I'm the lead developer on VEP, so just a note from my perspective. I also contributed to the discussion on the mailing list last year, but a few things have changed since then.

We decided to force VEP results into VCF as a delimited string, avoiding reserved VCF characters. In places where there would be clashes, we have converted characters in different ways that I'm not totally happy with; " " becomes "_" (lossy), "=" becomes "%3D" (URL-encoded, confusing because the conversion isn't applied to other characters), "," becomes "&"

@pcingola (snpEff author) and I discussed this and we now have a near exactly matching format specification in both snpEff and VEP; we differ slightly on field order and the default INFO key used.

Regarding inserting structured data; yes, in theory this would be nice and would probably not require too much pain to code on the data writing side. My concerns are:

spaces : we can continue converting to underscore, but this is lossy as there are other places where underscore actually means underscore
base64 : I don't like this. It makes it splurgy and ungreppable. At least with the current delimited string you can quickly grep for e.g. missense
splurge : JSON encoding necessarily includes each hash key each and every row, often more than once depending on how many objects might occur in a given array. In our current system the order of fields is defined in a header once, nicely saving space (while making it illegible to human readers :-)). And the less said about XML splurge the better!

In any other case I'd suggest sidecar files, but I guess one of the advantages of VCF is its easy portability as a single file.

pcingola commented 9 years ago

-1 (actually -99999999999999999)

I think this is a terrible idea, akin to supporting structured data types inside a FASTA header.

Parsing a VCF file, which can now roughly be done with a few lines of code would require to load JSON libraries and possibly decode base64 data. It would make it a pain to create the simplest script. This additional complexity might not be a big deal for people with years of experience, but it is a non-trivial entrance barrier for newcomers.

Similarly to other popular genomic formats (such as FASTA, GTF, etc.), VCF is "human readable". This makes VCF files relatively easier to debug. Adding JSON encoded within the INFO field will complicate everything to no avail and make debugging almost imposible without using specialized tools. As an example, take a look at the proposed 'EFF' field posted by @lindenb and tell me if you could debug anything just by looking at it. It would be a pain, to say the least.

As Will just explained, for the case of variant annotations, there is already a standard we are using ( http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf ), so there is no need for this proposal.

VCF INFO fields are already complicated enough. Losing human readability is a major step backwards and (in my opinion) defeats one of the main VCF virtues.

If you want structured data types, I think that using GA4GH's data models would be a much better option.

lh3 commented 9 years ago

With VEP, SnpEff and potentially Annovar (I see Kai Wang in that PDF) on the same page, there doesn't seem a great need to officially advocate JSON in VCF. Users can always choose JSON if they prefer. The new spec allows space and special characters encoded in VCF.

droazen commented 9 years ago

I have to agree with @pcingola. -1 from me as well.

kaichop commented 9 years ago

Petr asked me to post a comment here (I am developer of ANNOVAR). I think VCF should ideally remain as a text, readable format, that can accommodate simple operations by machine as well when sorted and indexed. JSON/XML and base64 binary information will destroy this purpose.

pd3 commented 9 years ago

@lindenb I think you scared everybody with the examples (including me!!) :-) However, I should say that I like the general idea of having a single way of constructing structured fields because that would allow general purpose tools (such as bcftools) to support querying and filtering on these fields. Right now such programs either have to support every possible encoding users come up with, or treat them as strings and allow filtering using substring matching or regular expressions only.

As there is currently no support for the idea, I am closing the issue as resolved. Thanks to everyone who contributed to the discussion!

lbergelson commented 3 years ago

I think we should revisit this. It's come up repeatedly that we need a clear mechanism to store transcript dependent functional annotations, and all the existing schemes are horrible. The longer we've gone without it, the better some sort of encoded json seems. We already specify a scheme for % encoding of reserved characters. Would that be sufficient to insert a json blob into an info field?

lbergelson commented 3 years ago

@yfarjoun @d-cameron @lindenb I've reopened this based on recent conversations. Since we now have a built in scheme (as of 4.3) to encode reserved characters it seems like this should be very doable now.

d-cameron commented 3 years ago

We already specify a scheme for % encoding of reserved characters. Would that be sufficient to insert a json blob into an info field?

Yes. Whether it gets nicely decoded in JSON or remains a percent-encoded mess very much depend on which VCF parsing library you are using. The biggest risk I see is usage of JSON/XML when we don't have widespread parsing library support for percent-encoding as that means we're going to see mismatch of unencoded, correctly encoded, and double-encoded data floating about.

It sounds like you're reopening this as a request for transcript dependent functional annotations. Better to reclose this issue and marked as resolved by VCFv4.3 percent encoding and open a new issue to discuss the actual content for the functional annotations.

d-cameron commented 3 years ago

Or turn this issue into a discussion on whether we want to add first-class support for JSON or XML through the addition of relevant INFO Type fields. If we're adding XML then we'd also want to add support for defining the schema in the INFO/FORMAT header (percent-encoded XSD).

tskir commented 3 years ago

I don't believe the question is whether representing this in VCF is technically possible, but whether we should do it. I read the entire discussion and have to side with the opinion expressed by @pcingola and supported by @LeeTL1220 @willmclaren @droazen @lh3 @kaichop. I strongly believe that including JSON/XML/base64 blobs into VCF is not a good idea. I also think there are a couple of alternative solutions to discuss.

1. Sidecar JSON with complex accompanying information

@willmclaren was the first one to mention this. Although he raised concerns about portability, I don't think having two files instead of one is a problem at all. @pcingola also mentioned GA4GH models, which in this case I suppose would be the Variant Annotation standard. We can look into its current state, see if it supports the things people need, and propose modifications if necessary. Alternatively it's possible to have XML or even TSV as sidecar file formats, but I think it's better to stick with JSON given its widespread use in GA4GH standards.

2. Limited flexibility, human readable structured key-value data

I think the current specification for variant annotations in VCF, mentioned by @pcingola, is reasonable. However, I also agree with @pd3 that it would be nice to be able to query for those fields using e.g. bcftools. So maybe we could think about ways to represent limited structured key-value data inside annotations to make it possible. This notation, if we decide to go this way, would have to prioritise readability over flexibility. Definitely not a general schema, no arbitrary nesting depth, nothing which would require significant effort to parse or read. Perhaps something along the lines of a compact notation suggested by @lindenb: CSQ=(genes((name "my gene1") (chromStart 9)) ((name "my gene2") (chromStart 10))).

But ultimately, I think VCF in its current form is neither suitable nor intended for representing arbitrarily complex annotation data. I view its role as an intermediate/downstream format, making compromise between being human readable and machine readable.

d-cameron commented 3 years ago

Honestly, I don't think gene or transcript level annotations of any sort belong in VCF. VCF has no concept of genes or transcripts.

A set of record-level annotation of the functional consequences of the mutations in that record doesn't actually give you overall consequence of that set of mutations. The simple example is two frameshift indels that cancel each other out with the net result of a one or two amino acid changes. A more complex example is a functional gene fusion involves multiple breakpoints.

The reported functional impact of a MNV should not depend on whether it was reported as two phased SNVs in adjacent positions or in a single MNV record - they both result in the same sequence.

jkbonfield commented 3 years ago

At the GA4GH discussion someone raised the thought about linkage to external files.

That sounds like a far better solution. The ability to be able to specify related files somewhere (header?) along with a per record vocabulary for foreign keys into those files. Eg it could just be an accession number, gene name, transcript name, whatever. It avoids the whole mess of a format within a format, but does directly address the issue of linking variants to genes. It's also a really small change compared to encoding JSON blobs etc. (Just say no to XML horror!)

The counter argument of course is how are we going to use such a link? It's no good knowing this variant is like to the BRCA2 gene if we cannot index and search VCF by these links. If we can't do that, how do we pull out variants covering that gene? By chr/pos of course. In which case we've essentially already got the link because we have knowledge somewhere else on which gene/transcript covers which loci. I know it's more complex when it comes to structural variants of course, and it'd be nice to look at a single record and see all the myriad of transcripts affected, but again is that best done via identifier or purely by chr/pos?

tfenne commented 3 years ago

Sidecar files sound like a nightmare to me. I would take the current state of affairs (i.e. different tool authors wedging their own flavor of lightly-structured data into INFO fields) over sidecar files. Unless the vast majority of VCF tools out there support the sidecar we're going to end up with mismatches where one file is filtered/updated and the other isn't, and it's going to be a huge mess. At least with having data stuffed into INFO fields, when you filter or transform a VCF that information gets carried along.

I also would politely suggest that inventing a new form for structured data embedding in VCF is a bad idea. The format suggested by @tskir looks to have the same functionality as JSON, but with a different format (@tskir are you a LISP programmer by choice?). If something is to be adopted into the standard, I would suggest that JSON makes the most sense because there are robust JSON parsers available for every language out there, it's a well understood format and it meets the needs.

My 2c is that the choice should really be between a) doing nothing and b) adopting some formal support for INFO (and per-sample) fields with a Type=JSON.

d-cameron commented 3 years ago

Not that anyone except me actually uses it, but VCF already refers to an external file in the ##assembly header*. Technically ##contig as well but that's an input file so is unproblematic.

*Even then, I'm not spec compliant since I use an external bam as that allows me to report the supporting assembly contig alignments. The specs explicitly say fasta.

On Thu., 8 Oct. 2020, 19:37 James Bonfield, notifications@github.com wrote:

At the GA4GH discussion someone raised the thought about linkage to external files.

That sounds like a far better solution. The ability to be able to specify related files somewhere (header?) along with a per record vocabulary for foreign keys into those files. Eg it could just be an accession number, gene name, transcript name, whatever. It avoids the whole mess of a format within a format, but does directly address the issue of linking variants to genes. It's also a really small change compared to encoding JSON blobs etc. (Just say no to XML horror!)

The counter argument of course is how are we going to use such a link? It's no good knowing this variant is like to the BRCA2 gene if we cannot index and search VCF by these links. If we can't do that, how do we pull out variants covering that gene? By chr/pos of course. In which case we've essentially already got the link because we have knowledge somewhere else on which gene/transcript covers which loci. I know it's more complex when it comes to structural variants of course, and it'd be nice to look at a single record and see all the myriad of transcripts affected, but again is that best done via identifier or purely by chr/pos?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/75#issuecomment-705420851, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOBYOAEAIWEOUX5YRCT7EDSJV26JANCNFSM4A6FIR4Q .

d-cameron commented 1 year ago

Closing since VCF field encoding allows this to be done but we have no intention to add HTTP-style Content-Type metainformation.