pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

SIFT4G Annotator Standalone annotates only "NA" in VCF file #72

Closed mtejura closed 1 year ago

mtejura commented 1 year ago

Hi,

I'm trying to use the SIFT4G annotator for my vcf file ( in the format requested). Once I feed it through the annotator and try to parse the results file, I only see "NA" in all the SIFT annotations. Any thoughts or suggestions would help!

Thanks!

pauline-ng commented 1 year ago

Can you paste some lines of your VCF file so I can troubleshoot? (The header and then first 5 variants will be sufficient.)

mtejura commented 1 year ago

This is how my vcf file looks.

Thank you so much for helping troubleshoot!

CHROM POS ID REF ALT QUAL FILTER INFO

19 38433834 . G C Uncertain significance . . 19 38433853 . C A Uncertain significance . . 19 38433867 . T G Likely pathogenic . . 19 38440748 . G A Uncertain significance . . 19 38440775 . A C Uncertain significance . .

pauline-ng commented 1 year ago
mtejura commented 1 year ago

I just tried adding that header, but the annotator still does the same thing.

mtejura commented 1 year ago

^ oh and i should mention my file is tab-delimited.

pauline-ng commented 1 year ago

Is this human, and if so, which build? Can you send me your vcf file, and I will take a look.

mtejura commented 1 year ago

Yes, it's human and the build is GRCh38! The github comment won't support the vcf file, where can I send it to you? Thank you!

pauline-ng commented 1 year ago

Can you put it in dropbox and send me a link?

mtejura commented 1 year ago

https://www.dropbox.com/s/p8xb6yzmbxhnhi1/ClinVar_RYR1_tab_4.vcf?dl=0

Here you go!

pauline-ng commented 1 year ago

I was able to annotate your file just fine (I added the header line ##fileformat=VCFv4.1)

My command was: java -jar SIFT4G_Annotator.jar -c -i ClinVar_RYR1_tab_4.vcf -d Databases/ -r res -t

Databases/contained chr19 from GRCh38.83 19.gz 19.regions

mtejura commented 1 year ago

I ran the same file (with the header) using the command line and the GUI, but all I get is "NA" for the SIFT score and significance. Could it be possible that these variants just aren't represented? When you run the file, do you get annotations for all the variants (including the score?). Also when I click on the hyperlink for your GRCh38.83 it says "Not Found". I've been using GRCh38.78. Is there a difference?

pauline-ng commented 1 year ago

The variants are represented -- all of them had predictions.

Its sounds like your database isn't loaded. Make sure to download GRCh38 database and extract the .zip file so it contains .gz and .regions. Instructions are step 1a

Here are the full weblinks https://sift.bii.a-star.edu.sg/sift4g/public//Homo_sapiens/GRCh38.78/ https://sift.bii.a-star.edu.sg/sift4g/public//Homo_sapiens/GRCh38.83.chr/

mtejura commented 1 year ago

I might have got it working. However, there are very few annotations for the RYR1 gene. I've attached my output file in this dropbox link. Is this what your output looks like as well? https://www.dropbox.com/home/SIFT%20annotator%20vcf%20file

pauline-ng commented 1 year ago

Hi,

I'm unable to access your dropbox link. When I look at my results in more detail, I see that the missense substitutions are labeled, but there are no SIFT predictions. If you're just studying the RYR1 protein or a few proteins, please try submitting the protein sequence to

SIFT sequence

The link above uses the original SIFT algorithm (not SIFT 4G).

Thanks, Pauline

mtejura commented 1 year ago

Hi,

Maybe this link will work.

https://www.dropbox.com/scl/fi/c8bjhilur6clzuw5pw5y5/ClinVar_RYR1_tab_4_SIFTannotations.xls?dl=0&rlkey=3007l0asr974zcwbr9mrs7448

I'm actually trying to annotate a few thousand proteins, so I would have to use the standalone.

pauline-ng commented 1 year ago

I'm getting the same output as you, so maybe it's unique to the RYR1 protein. What if you annotate the other proteins, is it the same result or do you get prediction scores?

mtejura commented 1 year ago

I tried passing through a 'whole genome' file of variants (basically a tab separated vcf file from the clinvar database) and none of these variants have a SIFT prediction either. It looks pretty much like the RYR1 protein. I feel like the format of the file is correct because when I run the commands, I can see how many variants were annotated and how many were not, but again, the end result is no prediction.

pauline-ng commented 1 year ago

I checked the GRCh38 database manually and there are SIFT predictions. RYR1 does not have predictions. Assuming these are missense variants, some of your proteins should have predictions. The stats file shows me ~95% proteins are predicted on. There's nothing I can do at this point, sorry.

mtejura commented 1 year ago

Okay, thank you for helping, I appreciate it. I will try and figure out a solution on my end. Should I come across one, I will post it here.