tkzeng / Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.
GNU General Public License v3.0
64 stars 32 forks source link

WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match? #2

Closed pablo-baeza closed 2 years ago

pablo-baeza commented 2 years ago

Hello,

I used to be able to use the Google collab notebook just fine, but with the latest updates from a couple of weeks ago, it doesn't seem to work anymore (even when I use the new db file gencode.v38lift37.annotation.Ensembl_canonical.db). I would appreciate it if you could help me figure this one out!

The Brca example dataset doesn't return any predictions at all. This is what the output VCF file looks like:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=Pangolin,Number=.,Type=String,Description="Pangolin splice scores. Format: gene|pos:score_change|pos:score_change|...">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr17   41276135    .   TTAA    GTCT    .   .   Pangolin=
chr17   41276135    .   T   G   .   .   Pangolin=
chr17   41276135    .   T   C   .   .   Pangolin=
chr17   41276135    .   T   A   .   .   Pangolin=
chr17   41276134    .   T   G   .   .   Pangolin=
chr17   41276134    .   T   C   .   .   Pangolin=
chr17   41276134    .   T   A   .   .   Pangolin=
chr17   41276133    .   C   T   .   .   Pangolin=
chr17   41276133    .   C   G   .   .   Pangolin=
chr17   41276133    .   C   A   .   .   Pangolin=
chr17   41276132    .   A   T   .   .   Pangolin=
chr17   41276132    .   A   G   .   .   Pangolin=
chr17   41276132    .   A   C   .   .   Pangolin=
chr17   41276131    .   A   T   .   .   Pangolin=
chr17   41276131    .   A   G   .   .   Pangolin=
chr17   41276131    .   A   C   .   .   Pangolin=
chr17   41276130    .   G   T   .   .   Pangolin=
chr17   41276130    .   G   C   .   .   Pangolin=
chr17   41276130    .   G   A   .   .   Pangolin=
chr17   41276129    .   T   G   .   .   Pangolin=
chr17   41276129    .   T   C   .   .   Pangolin=
chr17   41276129    .   T   A   .   .   Pangolin=
chr17   41276128    .   A   T   .   .   Pangolin=
chr17   41276128    .   A   C   .   .   Pangolin=
chr17   41276127    .   A   T   .   .   Pangolin=
chr17   41276127    .   A   G   .   .   Pangolin=
chr17   41276127    .   A   C   .   .   Pangolin=
chr17   41276126    .   C   T   .   .   Pangolin=
chr17   41276126    .   C   G   .   .   Pangolin=
chr17   41276126    .   C   A   .   .   Pangolin=
chr17   41276125    .   C   T   .   .   Pangolin=
chr17   41276125    .   C   G   .   .   Pangolin=
chr17   41276125    .   C   A   .   .   Pangolin=
chr17   41276124    .   T   G   .   .   Pangolin=
chr17   41276124    .   T   C   .   .   Pangolin=
chr17   41276124    .   T   A   .   .   Pangolin=

The other problem I am having is that Pangolin is complaining about the genome positions I am actually interested in (not just the example dataset) not being in a gene body. My input VCF file is the following:

##fileformat=VCFv4.2
##fileDate=20191004
##reference=GRCh37/hg19
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr12   52082542    G1A G   A   .   .   .
chr12   52082542    G1T G   T   .   .   .
chr12   52082542    G1C G   C   .   .   .
chr12   52082543    T2A T   A   .   .   .
chr12   52082543    T2G T   G   .   .   .
chr12   52082543    T2C T   C   .   .   .
chr12   52082576    T35A    T   A   .   .   .
chr12   52082576    T35G    T   G   .   .   .
chr12   52082576    T35C    T   C   .   .   .

These are point mutations in the SCN8A gene. These mutations fall inside the gene according to the annotations, but Pangolin returns the following error:

Using GPU
[Line 30] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 31] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 32] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 33] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 34] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 35] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 36] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 37] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
[Line 38] WARNING, skipping variant: Variant not contained in a gene body. Do GTF/FASTA chromosome names match?
tkzeng commented 2 years ago

Thanks for letting me know, the first bug should be fixed now. Sorry for the inconvenience!

It looks like for SCN8A, there are no Ensembl_canonical tagged transcripts, so there were actually no transcripts in the annotation file. I will update the annotation files soon to include a primary transcript when there are none tagged by Ensembl--I think this should fix the issue.

pablo-baeza commented 2 years ago

thanks a lot @tkzeng! Your model is awesome so I really appreciate you working on this.

After your bug no.1 fix, I could run pangolin on my test dataset, although I had to build a new annotation database using --filter None. This is the code I used, in case it is useful for you:

!pip install pyvcf gffutils biopython pandas pyfastx
!git clone https://github.com/tkzeng/Pangolin.git
%cd Pangolin
!pip install .
%cd /content

!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz
!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh37_mapping/gencode.v39lift37.annotation.gtf.gz

! python Pangolin/scripts/create_db.py --filter None gencode.v39lift37.annotation.gtf.gz

!pangolin 002b_test.vcf GRCh37.primary_assembly.genome.fa.gz gencode.v39lift37.annotation.db test.pangolin

The output file looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr12   52082542    G1A G   A   .   .   Pangolin=ENSG00000196876.9|26:0.02|0:-0.48
chr12   52082542    G1T G   T   .   .   Pangolin=ENSG00000196876.9|38:0.01|0:-0.73
chr12   52082542    G1C G   C   .   .   Pangolin=ENSG00000196876.9|-38:0.01|0:-0.66
chr12   52082543    T2A T   A   .   .   Pangolin=ENSG00000196876.9|-39:0.01|-1:-0.26
chr12   52082543    T2G T   G   .   .   Pangolin=ENSG00000196876.9|37:0.01|-1:-0.31
chr12   52082543    T2C T   C   .   .   Pangolin=ENSG00000196876.9|37:0.01|-1:-0.33
chr12   52082576    T35A    T   A   .   .   Pangolin=ENSG00000196876.9|-33:0.0|-50:0.0
chr12   52082576    T35G    T   G   .   .   Pangolin=ENSG00000196876.9|-8:0.0|-50:0.0
chr12   52082576    T35C    T   C   .   .   Pangolin=ENSG00000196876.9|-8:0.01|-50:0.0
tkzeng commented 2 years ago

Hi Pablo,

I have updated the GRCh37 annotation file so that at least all protein coding genes have a canonical transcript. I used the parameter --filter Ensembl_canonical,appris_principal,appris_candidate,appris_candidate_longest. I've found that only keeping the most relevant transcripts improves performance for pathogenicity / loss of function prediction, so I would recommend you try the updated annotation files or apply some filtering yourself if you are using -m True (which is the default). If you are using -m False, I have updated the code so that how you filter things makes no difference--the gene just has to exist somewhere in the annotation.

Let me know if you run into any more problems!

pablo-baeza commented 2 years ago

Awesome, thanks a million!