songlab-cal / gpn

Genomic Pre-trained Network
https://doi.org/10.1073/pnas.2311219120
MIT License
186 stars 24 forks source link

IndexError: tuple index out of range #16

Closed dahaigui closed 11 months ago

dahaigui commented 11 months ago

When I running Variant effect prediction, I had the following problem.

image

How should I address this problem?

dahaigui commented 11 months ago

Maybe the Scaffold is too short?

gonzalobenegas commented 11 months ago

That's the most likely cause - sorry for the trouble!

I've been usually first filtering the input vcf by removing the positions within window_size//2 of the scaffold borders. This are just a handful in my experience since I've been using chromosome-level assemblies.

However, if you really want to score those positions, you would have to modify the script. One possibility is to pad with [PAD] token, but this is not usually used during training (at least in the official script) so it might confuse the model. Haven't tried though. Another approach is to modify the script so that in scaffold edge cases it does not take a symmetrical window around a position, but a skewed one so it doesn't go overboard.

dahaigui commented 11 months ago

I just got the result file 20S0008_clean.snp_filtered_B9008.splited.vcf.parquet, how can I view its contents?

gonzalobenegas commented 11 months ago

In Python you can do:

import pandas as pd

scores = pd.read_parquet("20S0008_clean.snp_filtered_B9008.splited.vcf.parquet")
dahaigui commented 11 months ago

Got it! Thank you very much!

dahaigui commented 11 months ago

Is this result file sorted by the position in the vcf file I provided? I have annotated my vcf file using snpEFF and got some stop_gained variants, but these variants have bigger scores in gpn's variant effect prediction results, while some intergenic_region variants have smaller scores, as shown below.

Stop_gained, the result of snpEFF is A01 2901768 . G A 211.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.874;DP=14;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=15.12;ReadPosRankSum=0.125;SOR=1.179;ANN=A|stop_gained|HIGH|A01p05970.1_BraCCA|A01p05970.1_BraCCA|transcript|A01p05970.1_BraCCA.m1|protein_coding|2/4|c.697C>T|p.Gln233*|697/1122|697/1122|233/373||,A|upstream_gene_variant|MODIFIER|A01p05980.1_BraCCA|A01p05980.1_BraCCA|transcript|A01p05980.1_BraCCA.m1|protein_coding||c.-1868G>A|||||1868|,A|upstream_gene_variant|MODIFIER|A01p05990.1_BraCCA|A01p05990.1_BraCCA|transcript|A01p05990.1_BraCCA.m1|protein_coding||c.-3842G>A|||||3842|,A|downstream_gene_variant|MODIFIER|A01p05960.1_BraCCA|A01p05960.1_BraCCA|transcript|A01p05960.1_BraCCA.m1|protein_coding||c.*945G>A|||||945|;LOF=(A01p05970.1_BraCCA|A01p05970.1_BraCCA|1|1.00);NMD=(A01p05970.1_BraCCA|A01p05970.1_BraCCA|1|1.00) GT:AD:DP:GQ:PL 0/1:6,8:14:99:219,0,136, while result of gpn vep score is 0.44310555.

Intergenic_region, the result of snpEFF is A01 12314027 . A G 57.64 MQ40;QD2;SOR3 AC=1;AF=0.500;AN=2;BaseQRankSum=0.317;DP=161;ExcessHet=3.0103;FS=32.638;MLEAC=1;MLEAF=0.500;MQ=28.85;MQRankSum=-1.815e+00;QD=1.34;ReadPosRankSum=2.57;SOR=3.912;ANN=G|intergenic_region|MODIFIER|A01p22380.1_BraCCA-A01p22390.1_BraCCA|A01p22380.1_BraCCA-A01p22390.1_BraCCA|intergenic_region|A01p22380.1_BraCCA-A01p22390.1_BraCCA|||n.12314027A>G|||||| GT:AD:DP:GQ:PL 0/1:36,7:66:65:65,0,1202, while result of gpn vep score is -4.8882227.

Is this a problem with the way it's sorted, or is there something else going on?

gonzalobenegas commented 11 months ago

The result file should have the same order as the input file.

Indeed it's surprising the stop gained has a higher score than intergenic region. Is this a general pattern? What about missense, for example?

I'd be happy to try help more either here or in private.

dahaigui commented 11 months ago

The missense is A01 395168 . C T 125.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-5.300e-02;DP=19;ExcessHet=3.0103;FS=4.561;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=6.98;ReadPosRankSum=-8.800e-02;SOR=1.170;ANN=T|missense_variant|MODERATE|A01p00810.1_BraCCA|A01p00810.1_BraCCA|transcript|A01p00810.1_BraCCA.m1|protein_coding|15/15|c.1990G>A|p.Val664Ile|1990/2082|1990/2082|664/693||WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS,T|downstream_gene_variant|MODIFIER|A01p00800.1_BraCCA|A01p00800.1_BraCCA|transcript|A01p00800.1_BraCCA.m1|protein_coding||c.*4357C>T|||||4357|,T|downstream_gene_variant|MODIFIER|A01p00820.1_BraCCA|A01p00820.1_BraCCA|transcript|A01p00820.1_BraCCA.m1|protein_coding||c.*3853G>A|||||3853| GT:AD:DP:GQ:PL 0/1:13,5:19:99:133,0,441 ,the score of the position is 2.5295594

gonzalobenegas commented 11 months ago

A good way to assess whether this is a general pattern might be to gather at least n=30 variants from each type and plot their mean scores (or their distribution).

dahaigui commented 11 months ago

I just calculated the mean scores of different variant types, the result is image

gonzalobenegas commented 11 months ago

Thanks. Doesn't seem like this model has learned anything. I could provide some guidance with the following information:

dahaigui commented 11 months ago

I'd happy to take this into a private conversation. Email me at dahaiguilalala@163.com!