labsquare / cutevariant

A standalone and free application to explore genetics variations from VCF file
https://cutevariant.labsquare.org/
GNU General Public License v3.0
102 stars 21 forks source link

Types of fields #182

Closed ysard closed 3 years ago

ysard commented 3 years ago

In sample_has_variant:

Can you review these types so I will fix them in parsers ?

ysard commented 3 years ago

After reading the VCF 4.2 spec, it appears that we take only the type from INFO headers:

INFO=<ID=ID,Number=number,Type=type,

But ! The Number field gives the number of expected elements of Type type.

So, from simple examples, to REAL pain in the ass:

##FORMAT=<ID=AF,Number=1,Type=Float
=> 1 float value
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
=> 1 integer value
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
=> single number (for a string ><) but in fact its 1 value per genotype; separated by | or /
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
=> 1 value for each possible allele
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes
=> 1 value per alternate allele

When we put data in AD field of INTEGER type we put in fact string "64,0" (comma separated integers) in this field. And the best part is that SQLite accepts it thanks to column affinity and auto typing: https://www.sqlite.org/datatype3.html

select ad, typeof(ad), gt, typeof(gt), gq, typeof(gq), foxog, typeof(foxog), qss, typeof(qss) from sample_has_variant;

ad  typeof(ad)  gt  typeof(gt)  gq  typeof(gq)  foxog   typeof(foxog)   qss typeof(qss)
"64,0"  "text"  "0" "integer"   ""  "text"  ""  "null"  "2342,0"    "text"
"137,1" "text"  "0" "integer"   ""  "text"  "0.0"   "real"  "4973,7"    "text"

Moreover few of the types we give to the RDBMS are correct (if it is none): Accepted types are in UPPER case (ours not) and are the following ones: TEXT NUMERIC INTEGER REAL BLOB

We give: int, str, float, bool Rules of determination of Column Affinity (see the doc) may not work at all with such types.

Summary: Wrong types may be a secondary problem for now but what can we do with composite fields cited above ? Querying them should be VERY inefficient... (Since they are all not indexed TEXT fields by default).

I remove this issue from Mendel, this is currently a more general question than a bug.

dridk commented 3 years ago

Yes, gt must be an integer.. About composite value, I though about creating many fields :

SpliceAI = 3,2,5,6   

become :

SpiceAI_1 = 3 
SplceAI_2 = 2 
SpliceAI_3 = 5
SpliceAI_4 = 6

Or, better solution, we can create a custom type : https://docs.python.org/2.5/lib/node347.html

 SpliceAI = 3,2,5,6   ==>  tuple(3,2,5,6)  

FROM VQL :

SELECT spliceAI[0]  FROM variants 

I propose to test the second solution !

dridk commented 3 years ago

todo :

VcfReader.get_fields should returns :

{
"name": "spliceai"
"type": "tuple"
}

VcfReader.get_variants should returns :

{
"chr": "chr1",
"spliceai": (1,4,5,43)
}

From columns selector :

From VQL :

SELECT spliceai[0]  FROM variants WHERE spliceai[0] == 1 ! <===== NOT SURE IT IS POSSIBLE TO DO THE WHERE CLAUSE

dridk commented 3 years ago

This is an example of spliceAI output :

https://github.com/Illumina/SpliceAI/blob/master/examples/output.vcf

As you can see.. here Number = "." ... and value are separated by "|"

##INFO=<ID=SpliceAI,Number=.,Type=String,Description="SpliceAIv1.3.1 variant annotation. These include delta scores (DS) and delta positions (DP) for acceptor gain (AG), acceptor loss (AL), donor gain (DG), and donor loss (DL). Format: ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   25000   .   A   C,G,T   .   .   .
2   152389953   .   T   A,C,G   .   .   SpliceAI=A|NEB|0.01|0.00|0.00|0.74|43|3|-26|3,C|NEB|0.04|0.00|0.00|0.71|43|3|-26|3,G|NEB|0.03|0.00|0.00|0.75|43|3|-26|3
2   179415988   .   C   CA  .   .   SpliceAI=CA|TTN|0.07|1.00|0.00|0.00|-7|-1|35|-29
2   179446218   .   ATACT   A   .   .   SpliceAI=A|TTN|0.00|0.00|0.02|0.91|-7|34|-11|8
2   179446218   .   ATACT   AT,ATA  .   .   SpliceAI=AT|TTN|.|.|.|.|.|.|.|.,ATA|TTN|.|.|.|.|.|.|.|.
2   179642185   .   G   A   .   .   SpliceAI=A|TTN|0.00|0.00|0.64|0.55|2|38|2|-38
19  38958362    .   C   T   .   .   SpliceAI=T|RYR1|0.00|0.00|0.91|0.08|-28|-46|-2|-31
21  47406854    .   CCA C   .   .   SpliceAI=C|COL6A1|0.04|0.98|0.00|0.00|-38|4|38|4
21  47406856    .   A   AT  .   .   SpliceAI=AT|COL6A1|0.03|0.99|0.00|0.00|-40|2|36|2
X   129274636   .   A   C,G,T   .   .   SpliceAI=C|AIFM1|0.00|0.18|0.00|0.00|-28|-44|-44|45,G|AIFM1|0.00|0.17|0.00|0.00|-8|-44|-44|45,T|AIFM1|0.00|0.19|0.00|0.00|-2|-44|-44|45
ysard commented 3 years ago

spliceai[0]

Beware with this kind of "function" it is already a pain in the ass to handle in VQL with tuples, custom field types in filter editor, etc. It should be already fixed/removed for WORDSETS.

I didn't find SpliceAI field in VCF norm, I suppose it's a kind of annotation field from another technology/soft. So it must be put in annotations table. And as you see, there are custom fields not currently expected in annotations table.

Given that we have an annotation parser class for SNPEff and VEP, we should have one for SpliceAI, and thus put these data in a dedicated SQL table.

Lot of work to support this...

dridk commented 3 years ago

For multi-value, I propose to create one sql column per value with a specific name and abstract the query from VQL :

VCF: Field =3,2,3

SQL: Field0 = 3 Field1 = 2 Field__2 = 3

VQL: Field[0] will be replaced by Field__0 internally

this method makes possible the following statements:

Select Field[0] from variant where Field[1] > 4