Closed ysard closed 3 years ago
After reading the VCF 4.2 spec, it appears that we take only the type from INFO headers:
INFO=<ID=ID,Number=number,Type=type,
But ! The Number field gives the number of expected elements of Type type.
So, from simple examples, to REAL pain in the ass:
##FORMAT=<ID=AF,Number=1,Type=Float
=> 1 float value
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
=> 1 integer value
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
=> single number (for a string ><) but in fact its 1 value per genotype; separated by | or /
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
=> 1 value for each possible allele
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes
=> 1 value per alternate allele
When we put data in AD field of INTEGER type we put in fact string "64,0"
(comma separated integers) in this field.
And the best part is that SQLite accepts it thanks to column affinity and auto typing: https://www.sqlite.org/datatype3.html
select ad, typeof(ad), gt, typeof(gt), gq, typeof(gq), foxog, typeof(foxog), qss, typeof(qss) from sample_has_variant;
ad typeof(ad) gt typeof(gt) gq typeof(gq) foxog typeof(foxog) qss typeof(qss)
"64,0" "text" "0" "integer" "" "text" "" "null" "2342,0" "text"
"137,1" "text" "0" "integer" "" "text" "0.0" "real" "4973,7" "text"
Moreover few of the types we give to the RDBMS are correct (if it is none): Accepted types are in UPPER case (ours not) and are the following ones: TEXT NUMERIC INTEGER REAL BLOB
We give: int, str, float, bool
Rules of determination of Column Affinity (see the doc) may not work at all with such types.
Summary: Wrong types may be a secondary problem for now but what can we do with composite fields cited above ? Querying them should be VERY inefficient... (Since they are all not indexed TEXT fields by default).
I remove this issue from Mendel, this is currently a more general question than a bug.
Yes, gt must be an integer.. About composite value, I though about creating many fields :
SpliceAI = 3,2,5,6
become :
SpiceAI_1 = 3
SplceAI_2 = 2
SpliceAI_3 = 5
SpliceAI_4 = 6
Or, better solution, we can create a custom type : https://docs.python.org/2.5/lib/node347.html
SpliceAI = 3,2,5,6 ==> tuple(3,2,5,6)
FROM VQL :
SELECT spliceAI[0] FROM variants
I propose to test the second solution !
VcfReader.get_fields should returns :
{
"name": "spliceai"
"type": "tuple"
}
VcfReader.get_variants should returns :
{
"chr": "chr1",
"spliceai": (1,4,5,43)
}
From columns selector :
From VQL :
SELECT spliceai[0] FROM variants WHERE spliceai[0] == 1 ! <===== NOT SURE IT IS POSSIBLE TO DO THE WHERE CLAUSE
This is an example of spliceAI output :
https://github.com/Illumina/SpliceAI/blob/master/examples/output.vcf
As you can see.. here Number = "." ... and value are separated by "|"
##INFO=<ID=SpliceAI,Number=.,Type=String,Description="SpliceAIv1.3.1 variant annotation. These include delta scores (DS) and delta positions (DP) for acceptor gain (AG), acceptor loss (AL), donor gain (DG), and donor loss (DL). Format: ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 25000 . A C,G,T . . .
2 152389953 . T A,C,G . . SpliceAI=A|NEB|0.01|0.00|0.00|0.74|43|3|-26|3,C|NEB|0.04|0.00|0.00|0.71|43|3|-26|3,G|NEB|0.03|0.00|0.00|0.75|43|3|-26|3
2 179415988 . C CA . . SpliceAI=CA|TTN|0.07|1.00|0.00|0.00|-7|-1|35|-29
2 179446218 . ATACT A . . SpliceAI=A|TTN|0.00|0.00|0.02|0.91|-7|34|-11|8
2 179446218 . ATACT AT,ATA . . SpliceAI=AT|TTN|.|.|.|.|.|.|.|.,ATA|TTN|.|.|.|.|.|.|.|.
2 179642185 . G A . . SpliceAI=A|TTN|0.00|0.00|0.64|0.55|2|38|2|-38
19 38958362 . C T . . SpliceAI=T|RYR1|0.00|0.00|0.91|0.08|-28|-46|-2|-31
21 47406854 . CCA C . . SpliceAI=C|COL6A1|0.04|0.98|0.00|0.00|-38|4|38|4
21 47406856 . A AT . . SpliceAI=AT|COL6A1|0.03|0.99|0.00|0.00|-40|2|36|2
X 129274636 . A C,G,T . . SpliceAI=C|AIFM1|0.00|0.18|0.00|0.00|-28|-44|-44|45,G|AIFM1|0.00|0.17|0.00|0.00|-8|-44|-44|45,T|AIFM1|0.00|0.19|0.00|0.00|-2|-44|-44|45
spliceai[0]
Beware with this kind of "function" it is already a pain in the ass to handle in VQL with tuples, custom field types in filter editor, etc. It should be already fixed/removed for WORDSETS.
I didn't find SpliceAI field in VCF norm, I suppose it's a kind of annotation field from another technology/soft. So it must be put in annotations table. And as you see, there are custom fields not currently expected in annotations table.
Given that we have an annotation parser class for SNPEff and VEP, we should have one for SpliceAI, and thus put these data in a dedicated SQL table.
Lot of work to support this...
For multi-value, I propose to create one sql column per value with a specific name and abstract the query from VQL :
VCF: Field =3,2,3
SQL: Field0 = 3 Field1 = 2 Field__2 = 3
VQL: Field[0] will be replaced by Field__0 internally
this method makes possible the following statements:
Select Field[0] from variant where Field[1] > 4
In
sample_has_variant
:Can you review these types so I will fix them in parsers ?