pcingola / SnpEff

Other
243 stars 78 forks source link

SnpEff / mat_protein features? #252

Closed bbimber closed 3 years ago

bbimber commented 3 years ago

Hello - I am using SnpEff to annotate SARS-CoV-2 data (NC_045512). As you might know, SARS-CoV-2 has several large ORFs, which are processed into discrete peptides. These features are annotated in the GTF from NCBI as "mat_protein". See here: https://www.ncbi.nlm.nih.gov/nuccore/1798174254. it would be helpful if I could coax SnpEff to report the Snp effects based on these processed proteins, instead of just the primary ORF.

From what I can see in SnpEff's code, it is designed to parse 'mat_protein' features; however, when i process my data the resulting ANN annotations all refer to ORF1ab. Is there a way to either change the way I build a database from the GTF to account for these, or a change in how I execute SnpEff that would report on mat_protein features?

If not, I assume I could alter my GTF to remove ORF1ab and instead include only the mat_protien features, re-labeled as exon?

Thanks for any help.

pcingola commented 3 years ago

I thought that mat_proteins were processed the same as a "transcript" and "product" would be used as transcript ID. I was assuming this should be enough to clearly distinguish between "mat_proteins". Can you send me a simple example (VCF input and output lines) to better understand and debug the issue?

I highly recommend that you don't use GTF files, since SnpEff can process GenBank files directly (and GTF files don't have standard ways to represent ribosomal slippage events, present in SARS-CoV-2 genome.

bbimber commented 3 years ago

Thanks for the fast reply - i wasnt aware Genbank parsing was an option.

I also see I might have mistyped. "mat_peptide" is the feature name. i'm re-running snpeff using this, but will definitely check out the genbank route.

pcingola commented 3 years ago

Also, all the SARS databases (including SARV-CoV-2) are available for you to download in SnpEff v5.0, so there is no need to build them yourself.

pcingola commented 3 years ago
$ java -jar snpEff.jar download -v NC_045512.2
00:00:00    SnpEff version SnpEff 5.0 (build 2020-08-09 21:23), by Pablo Cingolani
00:00:00    Command: 'download'
00:00:00    Reading configuration file 'snpEff.config'. Genome: 'NC_045512.2'
00:00:00    Reading config file: /Users/kqrw311/workspace/SnpEff/snpEff.config
00:00:00    done
00:00:00    Downloading database for 'NC_045512.2'
00:00:00    Connecting to https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_NC_045512.2.zip
00:00:01    Local file name: '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_0_NC_045512.2.zip'

00:00:01    Donwload finished. Total 20946 bytes.
00:00:01    Extracting file 'data/NC_045512.2/snpEffectPredictor.bin'
00:00:01    Unzip: OK
00:00:01    Deleted local file '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_0_NC_045512.2.zip'
00:00:01    Done
00:00:01    Logging