pcingola / SnpEff

Other
237 stars 76 forks source link

Error: Cannot find first coding exon for transcript when building database #230

Closed tshalev closed 3 years ago

tshalev commented 5 years ago

Hello,

I am trying to build a database for trees species (Western Redcedar). I have a draft genome and some annotations in GFF3 format. When I try to build the database I get the following error:

Adjusting transcripts: Adjusting genes: Adjusting chromosomes lengths: Ranking exons: .................................................................................................... 10000 .................................................................................................... 20000 .................................................................................................... 30000 .................................................................................................... 40000 .................................................................................................... 50000 ............................................................ Create UTRs from CDS (if needed): Correcting exons based on frame information. ....java.lang.RuntimeException: Error: Cannot find first coding exon for transcript: 29184128:-672-2175, strand: -, id:PAC4GC:47054313, bioType:protein_coding, Protein 5'UTR : 29184128 2067-2175 UTR_5_PRIME 'PAC4GC:47054313.five_prime_UTR.1' Exons: 29184128:-672--546 'PAC4GC:47054313.exon.2', rank: 3, frame: 2, sequence: cttctaccctgaatctgatgagcttgctgtgggaaaatacagtcccaacaagctggaacagtggtacagatccctgtgactttcactgggatggggtgaactgcacaaatggccgcataacgtcact 29184128:-200--7 'PAC4GC:47054313.exon.1', rank: 2, frame: ., sequence: tactagtgtaaccctcataatttgcaggctcttctttttcttcaattttagccactattactgtttgaactcttaacttattttggcatgacataagttcaaatagaatatgaggactagatgttttggtgggttatgcttgatttttcttttcatggcttccctcttctttggagtcacaaacagcgatgatg 29184128:37-112 'PAC4GC:47054313.exon.3', rank: 1, frame: 1, sequence: aaaattatcaagcgtggggcttaagggagctctctcaaataaaattggttctctgacagcacttcatactctgtaa CDS : ctttttcttcaattttagccactattactgtttgaactcttaacttattttggcatgacataagttcaaatagaatatgaggactagatgttttggtgggttatgcttgatttttcttttcatggcttccctcttctttggagtcacaaacagcgatgatgcttctaccctgaatctgatgagcttgctgtgggaaaatacagtcccaacaagctggaacagtggtacagatccctgtgactttcactgggatggggtgaactgcacaaatggccgcataacgtcact Protein : LFLQFPLLLFELLTYFGMTVQIEYEDMFWWVMLDFSFHGFPLLWSHKQRCFYPESDELAVGKYSPNKLEQWYRSL*LSLGWGELHKWPHNVT

at org.snpeff.interval.Transcript.getFirstCodingExon(Transcript.java:1136)
at org.snpeff.interval.Transcript.frameCorrectionFirstCodingExon(Transcript.java:909)
at org.snpeff.interval.Transcript.frameCorrection(Transcript.java:878)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.frameCorrection(SnpEffPredictorFactory.java:596)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.finishUp(SnpEffPredictorFactory.java:545)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryGff.create(SnpEffPredictorFactoryGff.java:348)
at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
at org.snpeff.SnpEff.run(SnpEff.java:1183)
at org.snpeff.SnpEff.main(SnpEff.java:162)

java.lang.RuntimeException: Error reading file '/mnt/e/tal/Documents/UBC/GSAT/PhD/WRC/GS/wrc/snps/S_lines/filtering_for_pop_gen/new_analysis/snpEff/./data/tpli_3.1/genes.gff' java.lang.RuntimeException: Error: Cannot find first coding exon for transcript: 29184128:-672-2175, strand: -, id:PAC4GC:47054313, bioType:protein_coding, Protein 5'UTR : 29184128 2067-2175 UTR_5_PRIME 'PAC4GC:47054313.five_prime_UTR.1' Exons: 29184128:-672--546 'PAC4GC:47054313.exon.2', rank: 3, frame: 2, sequence: cttctaccctgaatctgatgagcttgctgtgggaaaatacagtcccaacaagctggaacagtggtacagatccctgtgactttcactgggatggggtgaactgcacaaatggccgcataacgtcact 29184128:-200--7 'PAC4GC:47054313.exon.1', rank: 2, frame: ., sequence: tactagtgtaaccctcataatttgcaggctcttctttttcttcaattttagccactattactgtttgaactcttaacttattttggcatgacataagttcaaatagaatatgaggactagatgttttggtgggttatgcttgatttttcttttcatggcttccctcttctttggagtcacaaacagcgatgatg 29184128:37-112 'PAC4GC:47054313.exon.3', rank: 1, frame: 1, sequence: aaaattatcaagcgtggggcttaagggagctctctcaaataaaattggttctctgacagcacttcatactctgtaa CDS : ctttttcttcaattttagccactattactgtttgaactcttaacttattttggcatgacataagttcaaatagaatatgaggactagatgttttggtgggttatgcttgatttttcttttcatggcttccctcttctttggagtcacaaacagcgatgatgcttctaccctgaatctgatgagcttgctgtgggaaaatacagtcccaacaagctggaacagtggtacagatccctgtgactttcactgggatggggtgaactgcacaaatggccgcataacgtcact Protein : LFLQFPLLLFELLTYFGMTVQIEYEDMFWWVMLDFSFHGFPLLWSHKQRCFYPESDELAVGKYSPNKLEQWYRSL*LSLGWGELHKWPHNVT

at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryGff.create(SnpEffPredictorFactoryGff.java:353)
at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
at org.snpeff.SnpEff.run(SnpEff.java:1183)
at org.snpeff.SnpEff.main(SnpEff.java:162)

00:22:17 Logging 00:22:18 Checking for updates...

When I try deleting the offending sequence from the gff file it just finds an issue with another one. For reference, the gff file looks like this on this sequence:

gff-version 3

annot-version v3.1

species Thuja plicata

29184128 JGI_gene mRNA 38 2176 . - . ID=PAC4GC:47054313;Name=Thpliv31003279m;longest=1;Parent=Thpliv31003279m.g 29184128 JGI_gene exon 1983 2176 . - . ID=PAC4GC:47054313.exon.1;Parent=PAC4GC:47054313 29184128 JGI_gene CDS 1983 2067 . - 0 ID=PAC4GC:47054313.CDS.1;Parent=PAC4GC:47054313 29184128 JGI_gene five_prime_UTR 2068 2176 . - . ID=PAC4GC:47054313.five_prime_UTR.1;Parent=PAC4GC:47054313 29184128 JGI_gene exon 1511 1637 . - . ID=PAC4GC:47054313.exon.2;Parent=PAC4GC:47054313 29184128 JGI_gene CDS 1511 1637 . - 2 ID=PAC4GC:47054313.CDS.2;Parent=PAC4GC:47054313 29184128 JGI_gene exon 38 113 . - . ID=PAC4GC:47054313.exon.3;Parent=PAC4GC:47054313 29184128 JGI_gene CDS 38 113 . - 1 ID=PAC4GC:47054313.CDS.3;Parent=PAC4GC:47054313

Sorry if this is kind of messy, I couldn't figure out how to make the table look better here.

VenithaB commented 4 years ago

Hi! I'm getting the same error!

Adjusting transcripts: Adjusting genes: Adjusting chromosomes lengths: Ranking exons: .................................................................................................... 10000 .................................................................................................... 20000 .................................................................................................... 30000 ............................................ Create UTRs from CDS (if needed): Correcting exons based on frame information. java.lang.RuntimeException: Error: Cannot find first coding exon for transcript: NIGP01000374:-3367-38263, strand: -, id:AAEL023102-RA 5'UTR : NIGP01000374 38195-38263 UTR_5_PRIME 'UTR5_NIGP01000374_38196_38264' Exons: NIGP01000374:-3367--3191 'EXON_NIGP01000374_38088_38264', rank: 2, frame: .,sequence: tcgcctacaatgctcaactagaaacaattactctaaggcgaaatccatctcacgttccaacctacgaaaatgcaattgaatggcacggtaacgatggctgcctcatctgaaccacccgagcctccacctcgcaatccggacaagatcaatgcatcactcaagcagctagccgaatcg

NIGP01000374:11027-11653 'EXON_NIGP01000374_11028_11654', rank: 1, frame: 0, sequence: aaaacccgttcgctggatacggccaccgataagacaaccgctccggccaccggtgcccgaccattccggcctatcctgtcgctggacaatgcaaagccattaacgaagccattcgaatcatctggaacgcccacgtcggcaccagcctcgtcgtttgccaacagtaacagtaacaacaataacaatggcagcagtcacaacagcagcatggaatcgaattcgaccagcacaaccgggggtccaaactcgggcaccggaaccagtggaagcagcatcagtagttccggtggaggcggaggtggtgacaatggccctgctgctgctgctgctgaactggtgagaggtggttcctcaggtagcggagtaagtccaccgggtgaaggcggtggaatagctggtcaaattggtaacaaattgaactccggtcaacagcagatctcgcccacgcagagtgaaaagagcagcacaggtgggagcaaggagcagtccggtgataattcgggcggcgataacctgttcaagaacggtgtgacagatctaggtgagtcgatagtattgttggtttatttggtaacatgtggaggtggagaattccgtatgaatatgattcatttttcatgatcgtaa

3'UTR : NIGP01000374 11027-11032 UTR_3_PRIME 'UTR3_NIGP01000374_11028_11033'

at org.snpeff.interval.Transcript.getFirstCodingExon(Transcript.java:1136)
at org.snpeff.interval.Transcript.frameCorrectionFirstCodingExon(Transcript.java:909)
at org.snpeff.interval.Transcript.frameCorrection(Transcript.java:878)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.frameCorrection(SnpEffPredictorFactory.java:596)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.finishUp(SnpEffPredictorFactory.java:545)
at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryGff.create(SnpEffPredictorFactoryGff.java:348)
at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
at org.snpeff.SnpEff.run(SnpEff.java:1183)
at org.snpeff.SnpEff.main(SnpEff.java:162)

java.lang.RuntimeException: Error reading file'/home/group_AM/Venitha/installations/snpEff_latest_core/snpEff/./data/AaegL5/genes.gtf'

tshalev commented 4 years ago

My solution was to not use SnpEff and use Variant Effect Predictor instead.

jiabowang commented 4 years ago

Hi there, I have soluted this issue. If we find this error, that means there are some genes in gtf file but not in fasta file. So we just have to remove this gene in gtf file. For example, sed -i "/ENSBGRT00000033763/d" genes.gtf

That works for my data. There is the bin file in my dataset folder.

pcingola commented 3 years ago

Closing old issues.

fanhuan commented 7 hours ago

I ran into similar problem and it was because my 5' UTR happened after start codon in one gene. FYI.