Closed pablo-baeza closed 2 years ago
Thanks for letting me know, the first bug should be fixed now. Sorry for the inconvenience!
It looks like for SCN8A, there are no Ensembl_canonical tagged transcripts, so there were actually no transcripts in the annotation file. I will update the annotation files soon to include a primary transcript when there are none tagged by Ensembl--I think this should fix the issue.
thanks a lot @tkzeng! Your model is awesome so I really appreciate you working on this.
After your bug no.1 fix, I could run pangolin on my test dataset, although I had to build a new annotation database using --filter None
. This is the code I used, in case it is useful for you:
!pip install pyvcf gffutils biopython pandas pyfastx
!git clone https://github.com/tkzeng/Pangolin.git
%cd Pangolin
!pip install .
%cd /content
!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz
!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh37_mapping/gencode.v39lift37.annotation.gtf.gz
! python Pangolin/scripts/create_db.py --filter None gencode.v39lift37.annotation.gtf.gz
!pangolin 002b_test.vcf GRCh37.primary_assembly.genome.fa.gz gencode.v39lift37.annotation.db test.pangolin
The output file looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr12 52082542 G1A G A . . Pangolin=ENSG00000196876.9|26:0.02|0:-0.48
chr12 52082542 G1T G T . . Pangolin=ENSG00000196876.9|38:0.01|0:-0.73
chr12 52082542 G1C G C . . Pangolin=ENSG00000196876.9|-38:0.01|0:-0.66
chr12 52082543 T2A T A . . Pangolin=ENSG00000196876.9|-39:0.01|-1:-0.26
chr12 52082543 T2G T G . . Pangolin=ENSG00000196876.9|37:0.01|-1:-0.31
chr12 52082543 T2C T C . . Pangolin=ENSG00000196876.9|37:0.01|-1:-0.33
chr12 52082576 T35A T A . . Pangolin=ENSG00000196876.9|-33:0.0|-50:0.0
chr12 52082576 T35G T G . . Pangolin=ENSG00000196876.9|-8:0.0|-50:0.0
chr12 52082576 T35C T C . . Pangolin=ENSG00000196876.9|-8:0.01|-50:0.0
Hi Pablo,
I have updated the GRCh37 annotation file so that at least all protein coding genes have a canonical transcript. I used the parameter --filter Ensembl_canonical,appris_principal,appris_candidate,appris_candidate_longest
. I've found that only keeping the most relevant transcripts improves performance for pathogenicity / loss of function prediction, so I would recommend you try the updated annotation files or apply some filtering yourself if you are using -m True
(which is the default). If you are using -m False
, I have updated the code so that how you filter things makes no difference--the gene just has to exist somewhere in the annotation.
Let me know if you run into any more problems!
Awesome, thanks a million!
Hello,
I used to be able to use the Google collab notebook just fine, but with the latest updates from a couple of weeks ago, it doesn't seem to work anymore (even when I use the new db file
gencode.v38lift37.annotation.Ensembl_canonical.db
). I would appreciate it if you could help me figure this one out!The Brca example dataset doesn't return any predictions at all. This is what the output VCF file looks like:
The other problem I am having is that Pangolin is complaining about the genome positions I am actually interested in (not just the example dataset) not being in a gene body. My input VCF file is the following:
These are point mutations in the SCN8A gene. These mutations fall inside the gene according to the annotations, but Pangolin returns the following error: