Closed dahaigui closed 11 months ago
Maybe the Scaffold is too short?
That's the most likely cause - sorry for the trouble!
I've been usually first filtering the input vcf by removing the positions within window_size//2 of the scaffold borders. This are just a handful in my experience since I've been using chromosome-level assemblies.
However, if you really want to score those positions, you would have to modify the script. One possibility is to pad with [PAD] token, but this is not usually used during training (at least in the official script) so it might confuse the model. Haven't tried though. Another approach is to modify the script so that in scaffold edge cases it does not take a symmetrical window around a position, but a skewed one so it doesn't go overboard.
I just got the result file 20S0008_clean.snp_filtered_B9008.splited.vcf.parquet
, how can I view its contents?
In Python you can do:
import pandas as pd
scores = pd.read_parquet("20S0008_clean.snp_filtered_B9008.splited.vcf.parquet")
Got it! Thank you very much!
Is this result file sorted by the position in the vcf file I provided? I have annotated my vcf file using snpEFF and got some stop_gained variants, but these variants have bigger scores in gpn's variant effect prediction results, while some intergenic_region variants have smaller scores, as shown below.
Stop_gained, the result of snpEFF is A01 2901768 . G A 211.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.874;DP=14;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=15.12;ReadPosRankSum=0.125;SOR=1.179;ANN=A|stop_gained|HIGH|A01p05970.1_BraCCA|A01p05970.1_BraCCA|transcript|A01p05970.1_BraCCA.m1|protein_coding|2/4|c.697C>T|p.Gln233*|697/1122|697/1122|233/373||,A|upstream_gene_variant|MODIFIER|A01p05980.1_BraCCA|A01p05980.1_BraCCA|transcript|A01p05980.1_BraCCA.m1|protein_coding||c.-1868G>A|||||1868|,A|upstream_gene_variant|MODIFIER|A01p05990.1_BraCCA|A01p05990.1_BraCCA|transcript|A01p05990.1_BraCCA.m1|protein_coding||c.-3842G>A|||||3842|,A|downstream_gene_variant|MODIFIER|A01p05960.1_BraCCA|A01p05960.1_BraCCA|transcript|A01p05960.1_BraCCA.m1|protein_coding||c.*945G>A|||||945|;LOF=(A01p05970.1_BraCCA|A01p05970.1_BraCCA|1|1.00);NMD=(A01p05970.1_BraCCA|A01p05970.1_BraCCA|1|1.00) GT:AD:DP:GQ:PL 0/1:6,8:14:99:219,0,136
, while result of gpn vep score is 0.44310555.
Intergenic_region, the result of snpEFF is A01 12314027 . A G 57.64 MQ40;QD2;SOR3 AC=1;AF=0.500;AN=2;BaseQRankSum=0.317;DP=161;ExcessHet=3.0103;FS=32.638;MLEAC=1;MLEAF=0.500;MQ=28.85;MQRankSum=-1.815e+00;QD=1.34;ReadPosRankSum=2.57;SOR=3.912;ANN=G|intergenic_region|MODIFIER|A01p22380.1_BraCCA-A01p22390.1_BraCCA|A01p22380.1_BraCCA-A01p22390.1_BraCCA|intergenic_region|A01p22380.1_BraCCA-A01p22390.1_BraCCA|||n.12314027A>G|||||| GT:AD:DP:GQ:PL 0/1:36,7:66:65:65,0,1202
, while result of gpn vep score is -4.8882227.
Is this a problem with the way it's sorted, or is there something else going on?
The result file should have the same order as the input file.
Indeed it's surprising the stop gained has a higher score than intergenic region. Is this a general pattern? What about missense, for example?
I'd be happy to try help more either here or in private.
The missense is A01 395168 . C T 125.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-5.300e-02;DP=19;ExcessHet=3.0103;FS=4.561;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=6.98;ReadPosRankSum=-8.800e-02;SOR=1.170;ANN=T|missense_variant|MODERATE|A01p00810.1_BraCCA|A01p00810.1_BraCCA|transcript|A01p00810.1_BraCCA.m1|protein_coding|15/15|c.1990G>A|p.Val664Ile|1990/2082|1990/2082|664/693||WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,T|downstream_gene_variant|MODIFIER|A01p00800.1_BraCCA|A01p00800.1_BraCCA|transcript|A01p00800.1_BraCCA.m1|protein_coding||c.*4357C>T|||||4357|,T|downstream_gene_variant|MODIFIER|A01p00820.1_BraCCA|A01p00820.1_BraCCA|transcript|A01p00820.1_BraCCA.m1|protein_coding||c.*3853G>A|||||3853| GT:AD:DP:GQ:PL 0/1:13,5:19:99:133,0,441
,the score of the position is 2.5295594
A good way to assess whether this is a general pattern might be to gather at least n=30 variants from each type and plot their mean scores (or their distribution).
I just calculated the mean scores of different variant types, the result is
Thanks. Doesn't seem like this model has learned anything. I could provide some guidance with the following information:
I'd happy to take this into a private conversation. Email me at dahaiguilalala@163.com!
When I running Variant effect prediction, I had the following problem.
How should I address this problem?