harta55 / EnTAP

Eukaryotic Non-Model Transcriptome Annotation Pipeline - Latest Release v1.4.0 - Revamped final graphics coming soon!
https://entap.readthedocs.io/en/latest/
GNU General Public License v3.0
37 stars 9 forks source link

add step to strip end of TransDecoder protein sequences of * (stop codon) #22

Closed dy-lin closed 4 years ago

dy-lin commented 4 years ago

Getting this error in the InterProScan part of the pipeline due to the TransDecoder peptides terminating with * to represent a stop codon. This occurs in EnTAP v0.10.3. I assume that it persists in v0.10.4 as the tag release description is Fixed an issue where expression analysis transcriptome generation would sometimes fail (error message presented to user as ‘frame selection’), and this is unrelated to the alignment/bam/expression filtering step.

30/07/2020 18:12:01:993 Welcome to InterProScan-5.30-69.0
30/07/2020 18:12:16:741 Running InterProScan v5 in STANDALONE mode... on Linux
30/07/2020 18:12:38:433 Loading file /projects/amp/peptaid/hymenoptera/omonticola/venom/annotation_20200728_110451/transcriptomes//rnabloom_final.fasta
30/07/2020 18:12:38:480 Running the following analyses:
[PANTHER-12.0,Pfam-31.0]
Available matches will be retrieved from the pre-calculated match lookup service.

Matches for any sequences that are not represented in the lookup service will be calculated locally.
2020-07-30 18:12:38,505 [amqEmbeddedWorkerJmsContainer-8] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:204] ERROR - Execution thrown when attempting to executeInTransaction the StepExecution.  All database activity rolled back.
java.lang.IllegalArgumentException: You have submitted a protein sequence which contains an asterix (*). This may be from an ORF prediction program. '*' is not a valid IUPAC amino acid character and amino acid sequences which go through our pipeline should not contain it. Please strip out all asterix characters from your sequence and resubmit your search.

Example peptide sequence from the /projects/amp/peptaid/hymenoptera/omonticola/venom/annotation_20200728_110451/transcriptomes//rnabloom_final.fasta

>01.U.100227
SRTLADLASLECFVHVDGIPMGDVVTAKQCLLRRNTALRFPLESLEMSKRSVAVFDARF*
harta55 commented 4 years ago

Will resolve this in a new version. A quick solution is to remove all of the * in your .pep file produced from Transdecoder then re-run EnTAP with the same commands

dy-lin commented 4 years ago

I am integrating EnTAP into a larger pipeline so manual removal is less than ideal. Any updates on when the next release (0.10.5) will be?

harta55 commented 4 years ago

I was planning on including it in the next major release in a few weeks (0.11.x), but I'll put in a quick patch today for it. I'll update this issue when it is resolved

harta55 commented 4 years ago

Fixed in bbdc88547ce613f3634f5e94d58e637982b4b89a and tagged as version 0.10.5