pcingola / SnpEff

Other
253 stars 80 forks source link

Is it possible to print DB creation warnings individually? #210

Closed apredeus closed 4 years ago

apredeus commented 6 years ago

Hello,

I am using SnpEff for custom comparison of our in-house bacterial strains, and it's working very well so far. Making DBs is easy and smooth, and virtually all the info I need is generated in minutes. However, there are some warnings when I create the database, which I would imagine mean problematic assembly or annotation. I would like to look at them in more detail. Is it possible to print something more than a summary in the log? And also, what are "length errors", "stop codon warnings", etc? I'm talking about this type of warning summary:

Protein coding transcripts : 4823 Length errors : 43 ( 0.89% ) STOP codons in CDS errors : 38 ( 0.79% ) START codon errors : 24 ( 0.50% ) STOP codon warnings : 16 ( 0.33% ) UTR sequences : 0 ( 0.00% ) Total Errors : 64 ( 1.33% )

Thank you in advance!

meixilin commented 4 years ago

Hi,

did you figure out this solution to this? Thank you so much!

apredeus commented 4 years ago

Hi. Not really - but I experimented with different formats of the annotation, and figured out the way to reduce the warnings to a minumum. Most of them were due to the presence of pseudogenes (stop codon in the middle of a feature) or ncRNAs that were not annotated as such. Once you fix those, there are virtually no warnings.

meixilin commented 4 years ago

Hi,

thanks a lot for your prompt reply! Sorry if this question is too naive. Is there someway to pull out the genes with warning from the snpEff process or we have to search for pseudogenes/ncRNAs in the GFF file, remove them in the GFF files and build the database again? Would you happen to have some scripts to share? Thank you so much!

Best, M

apredeus commented 4 years ago

Sorry, I don't have anything specific - usually these are makeshift commands I use. You can use bedtools to extract gene sequences in nucleotide form, and then convert it them to predicted proteins using EMBOSS transeq (both bedtools and emboss can be easily installed using bioconda).

After this, just look for genes with stop codons in the wrong place, without the stop in the end, etc.

Hope this helps!

meixilin commented 4 years ago

got it! thanks a lot! have a good day! 😄

pcingola commented 4 years ago

Closing old issues.