tkzeng / Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.
GNU General Public License v3.0
61 stars 32 forks source link

Problems predicting positive scores #3

Closed pablo-baeza closed 2 years ago

pablo-baeza commented 2 years ago

Hi Tony,

Apologies for bothering you again with this. I was running the latest version of pangolin on some previous exon I already did a while back, and interestingly, I get very different results.

These are the results I obtain after using pangolin with default parameters in early October:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
10  90770511    A2C A   C   .   .   Pangolin=ENSG00000026103.22_1|-2:0.0|-1:-0.12
10  90770511    A2G A   G   .   .   Pangolin=ENSG00000026103.22_1|-3:0.0|-1:-0.14
10  90770511    A2T A   T   .   .   Pangolin=ENSG00000026103.22_1|-1:0.18|-7:-0.0
10  90770512    T3A T   A   .   .   Pangolin=ENSG00000026103.22_1|-8:0.0|-2:-0.05
10  90770512    T3C T   C   .   .   Pangolin=ENSG00000026103.22_1|-8:0.0|-2:-0.12
10  90770512    T3G T   G   .   .   Pangolin=ENSG00000026103.22_1|1:0.0|-2:-0.11
10  90770513    C4A C   A   .   .   Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.15
10  90770513    C4G C   G   .   .   Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.1
10  90770513    C4T C   T   .   .   Pangolin=ENSG00000026103.22_1|-50:-0.0|-3:-0.18
10  90770514    C5A C   A   .   .   Pangolin=ENSG00000026103.22_1|-4:0.08|3:-0.0
10  90770514    C5G C   G   .   .   Pangolin=ENSG00000026103.22_1|-4:0.05|3:-0.0
10  90770514    C5T C   T   .   .   Pangolin=ENSG00000026103.22_1|-18:0.0|-4:-0.26
10  90770515    A6C A   C   .   .   Pangolin=ENSG00000026103.22_1|-5:0.02|-5:-0.01
10  90770515    A6G A   G   .   .   Pangolin=ENSG00000026103.22_1|-5:0.18|2:-0.0
10  90770515    A6T A   T   .   .   Pangolin=ENSG00000026103.22_1|13:0.0|-5:-0.06
10  90770516    G7A G   A   .   .   Pangolin=ENSG00000026103.22_1|2:0.0|-6:-0.15
10  90770516    G7C G   C   .   .   Pangolin=ENSG00000026103.22_1|-8:0.0|-6:-0.06
10  90770516    G7T G   T   .   .   Pangolin=ENSG00000026103.22_1|17:0.0|-6:-0.19
10  90770517    A8C A   C   .   .   Pangolin=ENSG00000026103.22_1|-13:0.0|-7:-0.04
10  90770517    A8G A   G   .   .   Pangolin=ENSG00000026103.22_1|0:0.0|-7:-0.19
10  90770517    A8T A   T   .   .   Pangolin=ENSG00000026103.22_1|44:0.0|-7:-0.09
10  90770518    T9A T   A   .   .   Pangolin=ENSG00000026103.22_1|-8:0.02|-8:-0.0
10  90770518    T9C T   C   .   .   Pangolin=ENSG00000026103.22_1|-8:0.06|-1:-0.0
10  90770518    T9G T   G   .   .   Pangolin=ENSG00000026103.22_1|-8:0.03|-8:-0.01
10  90770519    C10A    C   A   .   .   Pangolin=ENSG00000026103.22_1|-11:0.0|-9:-0.07
10  90770519    C10G    C   G   .   .   Pangolin=ENSG00000026103.22_1|-9:0.06|-15:-0.0
10  90770519    C10T    C   T   .   .   Pangolin=ENSG00000026103.22_1|-11:0.0|-9:-0.07
10  90770520    T11A    T   A   .   .   Pangolin=ENSG00000026103.22_1|-10:0.23|41:-0.0
10  90770520    T11C    T   C   .   .   Pangolin=ENSG00000026103.22_1|-10:0.28|41:-0.0
10  90770520    T11G    T   G   .   .   Pangolin=ENSG00000026103.22_1|-10:0.4|41:-0.0

This is what I obtain after running pangolin on the same exon today:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr10   90770511    A2C A   C   .   .   Pangolin=ENSG00000026103.23_1|22:0.0|-1:-0.14|Warnings:
chr10   90770511    A2G A   G   .   .   Pangolin=ENSG00000026103.23_1|22:0.0|-1:-0.1|Warnings:
chr10   90770511    A2T A   T   .   .   Pangolin=ENSG00000026103.23_1|6:0.01|-50:0.0|Warnings:
chr10   90770512    T3A T   A   .   .   Pangolin=ENSG00000026103.23_1|6:0.0|-2:-0.05|Warnings:
chr10   90770512    T3C T   C   .   .   Pangolin=ENSG00000026103.23_1|21:0.0|-2:-0.12|Warnings:
chr10   90770512    T3G T   G   .   .   Pangolin=ENSG00000026103.23_1|1:0.0|-2:-0.12|Warnings:
chr10   90770513    C4A C   A   .   .   Pangolin=ENSG00000026103.23_1|4:0.0|-3:-0.17|Warnings:
chr10   90770513    C4G C   G   .   .   Pangolin=ENSG00000026103.23_1|11:0.0|-3:-0.09|Warnings:
chr10   90770513    C4T C   T   .   .   Pangolin=ENSG00000026103.23_1|-16:0.0|-3:-0.21|Warnings:
chr10   90770514    C5A C   A   .   .   Pangolin=ENSG00000026103.23_1|4:0.0|-50:0.0|Warnings:
chr10   90770514    C5G C   G   .   .   Pangolin=ENSG00000026103.23_1|10:0.0|-50:0.0|Warnings:
chr10   90770514    C5T C   T   .   .   Pangolin=ENSG00000026103.23_1|-37:0.0|-4:-0.27|Warnings:
chr10   90770515    A6C A   C   .   .   Pangolin=ENSG00000026103.23_1|18:0.0|-50:0.0|Warnings:
chr10   90770515    A6G A   G   .   .   Pangolin=ENSG00000026103.23_1|-48:0.0|-50:0.0|Warnings:
chr10   90770515    A6T A   T   .   .   Pangolin=ENSG00000026103.23_1|18:0.0|-5:-0.0|Warnings:
chr10   90770516    G7A G   A   .   .   Pangolin=ENSG00000026103.23_1|45:0.0|-6:-0.17|Warnings:
chr10   90770516    G7C G   C   .   .   Pangolin=ENSG00000026103.23_1|17:0.0|-6:-0.05|Warnings:
chr10   90770516    G7T G   T   .   .   Pangolin=ENSG00000026103.23_1|17:0.0|-6:-0.21|Warnings:
chr10   90770517    A8C A   C   .   .   Pangolin=ENSG00000026103.23_1|7:0.0|-7:-0.02|Warnings:
chr10   90770517    A8G A   G   .   .   Pangolin=ENSG00000026103.23_1|27:0.0|-7:-0.17|Warnings:
chr10   90770517    A8T A   T   .   .   Pangolin=ENSG00000026103.23_1|7:0.0|-7:-0.09|Warnings:
chr10   90770518    T9A T   A   .   .   Pangolin=ENSG00000026103.23_1|43:0.0|-49:0.0|Warnings:
chr10   90770518    T9C T   C   .   .   Pangolin=ENSG00000026103.23_1|9:0.0|-50:0.0|Warnings:
chr10   90770518    T9G T   G   .   .   Pangolin=ENSG00000026103.23_1|0:0.0|-8:-0.02|Warnings:
chr10   90770519    C10A    C   A   .   .   Pangolin=ENSG00000026103.23_1|6:0.0|-9:-0.08|Warnings:
chr10   90770519    C10G    C   G   .   .   Pangolin=ENSG00000026103.23_1|3:0.0|-50:0.0|Warnings:
chr10   90770519    C10T    C   T   .   .   Pangolin=ENSG00000026103.23_1|5:0.0|-9:-0.08|Warnings:
chr10   90770520    T11A    T   A   .   .   Pangolin=ENSG00000026103.23_1|3:0.0|-49:0.0|Warnings:
chr10   90770520    T11C    T   C   .   .   Pangolin=ENSG00000026103.23_1|2:0.0|-50:0.0|Warnings:
chr10   90770520    T11G    T   G   .   .   Pangolin=ENSG00000026103.23_1|2:0.01|-49:0.0|Warnings:

Essentially, mutations with predicted positive effects (e.g. T11G) are no longer predicted to have a positive effect. This is troublesome because I have experimental mutagenesis data for this particular exon, and the 'old' pangolin was amazing at predicting the effects of mutations in my dataset, incuding those mutations with a positive score. However, it now seems like positive scores default to 0 in most situations.

Do you know if something changed between October and now that could have affected this?

Thanks again for all the troubleshooting!

tkzeng commented 2 years ago

I changed one of the default parameters: -m False to -m True. Here are the results with -m False.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr10   90770511        A2C     A       C       .       .       Pangolin=ENSG00000026103.22_1|22:0.0|-1:-0.14|Warnings:
chr10   90770511        A2G     A       G       .       .       Pangolin=ENSG00000026103.22_1|22:0.0|-1:-0.1|Warnings:
chr10   90770511        A2T     A       T       .       .       Pangolin=ENSG00000026103.22_1|-1:0.19|-7:-0.0|Warnings:
chr10   90770512        T3A     T       A       .       .       Pangolin=ENSG00000026103.22_1|6:0.0|-2:-0.05|Warnings:
chr10   90770512        T3C     T       C       .       .       Pangolin=ENSG00000026103.22_1|21:0.0|-2:-0.12|Warnings:
chr10   90770512        T3G     T       G       .       .       Pangolin=ENSG00000026103.22_1|1:0.0|-2:-0.12|Warnings:
chr10   90770513        C4A     C       A       .       .       Pangolin=ENSG00000026103.22_1|4:0.0|-3:-0.17|Warnings:
chr10   90770513        C4G     C       G       .       .       Pangolin=ENSG00000026103.22_1|11:0.0|-3:-0.09|Warnings:
chr10   90770513        C4T     C       T       .       .       Pangolin=ENSG00000026103.22_1|-16:0.0|-3:-0.21|Warnings:
chr10   90770514        C5A     C       A       .       .       Pangolin=ENSG00000026103.22_1|-4:0.09|3:-0.0|Warnings:
chr10   90770514        C5G     C       G       .       .       Pangolin=ENSG00000026103.22_1|-4:0.06|3:-0.0|Warnings:
chr10   90770514        C5T     C       T       .       .       Pangolin=ENSG00000026103.22_1|-37:0.0|-4:-0.27|Warnings:
chr10   90770515        A6C     A       C       .       .       Pangolin=ENSG00000026103.22_1|-5:0.04|2:-0.0|Warnings:
chr10   90770515        A6G     A       G       .       .       Pangolin=ENSG00000026103.22_1|-5:0.19|2:-0.0|Warnings:
chr10   90770515        A6T     A       T       .       .       Pangolin=ENSG00000026103.22_1|-5:0.04|9:-0.0|Warnings:
chr10   90770516        G7A     G       A       .       .       Pangolin=ENSG00000026103.22_1|45:0.0|-6:-0.17|Warnings:
chr10   90770516        G7C     G       C       .       .       Pangolin=ENSG00000026103.22_1|17:0.0|-6:-0.05|Warnings:
chr10   90770516        G7T     G       T       .       .       Pangolin=ENSG00000026103.22_1|17:0.0|-6:-0.21|Warnings:
chr10   90770517        A8C     A       C       .       .       Pangolin=ENSG00000026103.22_1|-7:0.02|-7:-0.02|Warnings:
chr10   90770517        A8G     A       G       .       .       Pangolin=ENSG00000026103.22_1|27:0.0|-7:-0.17|Warnings:
chr10   90770517        A8T     A       T       .       .       Pangolin=ENSG00000026103.22_1|7:0.0|-7:-0.09|Warnings:
chr10   90770518        T9A     T       A       .       .       Pangolin=ENSG00000026103.22_1|-8:0.02|6:-0.0|Warnings:
chr10   90770518        T9C     T       C       .       .       Pangolin=ENSG00000026103.22_1|-8:0.07|43:-0.0|Warnings:
chr10   90770518        T9G     T       G       .       .       Pangolin=ENSG00000026103.22_1|-8:0.03|-8:-0.02|Warnings:
chr10   90770519        C10A    C       A       .       .       Pangolin=ENSG00000026103.22_1|6:0.0|-9:-0.08|Warnings:
chr10   90770519        C10G    C       G       .       .       Pangolin=ENSG00000026103.22_1|-9:0.06|5:-0.0|Warnings:
chr10   90770519        C10T    C       T       .       .       Pangolin=ENSG00000026103.22_1|5:0.0|-9:-0.08|Warnings:
chr10   90770520        T11A    T       A       .       .       Pangolin=ENSG00000026103.22_1|-10:0.25|36:-0.0|Warnings:
chr10   90770520        T11C    T       C       .       .       Pangolin=ENSG00000026103.22_1|-10:0.3|14:-0.0|Warnings:
chr10   90770520        T11G    T       G       .       .       Pangolin=ENSG00000026103.22_1|-10:0.42|4:-0.01|Warnings:

I hope this matches what you expect. I have updated Pangolin's models as well, so the predictions may still differ a bit.

pablo-baeza commented 2 years ago

I see, that makes sense. Thanks a lot!