Closed GoogleCodeExporter closed 8 years ago
Dear Holger,
Thank you for sending the input file. I will try to reproduce the error and
will get back to you with an update. Unfortunately there is no workaround
except removing the causing sequence.
Original comment by Maxim.Sc...@gmail.com
on 19 Feb 2013 at 3:06
[deleted comment]
[deleted comment]
Could reproduced the error.
Original comment by Maxim.Sc...@gmail.com
on 20 Feb 2013 at 9:52
Dear Holger,
I've just noticed that your FASTA sequences are invalid as they contain plenty
of translation stop (*). My guess is that your sequences are a result of a ORF
prediction tool, is that correct? Unfortunately InterProScan doesn't print out
a warning what it should do in theory. Thank you again for reporting this. We
will fix that as soon as possible.
And yes we would suggest to pre-filter your sequences and re-analysis them with
again with InterProScan 5 afterwards.
Kind Regards,
Maxim
Original comment by Maxim.Sc...@gmail.com
on 20 Feb 2013 at 2:34
Hi Maxim,
I'm aware that we have lots of translation stops. The sequences are contigs
from a de-novo assembled transcriptome (http://trinityrnaseq.sourceforge.net/)
which I've just translated into all 6 reading frames. Thus it's not surprising
to have many translation stops in many of the sequences. However, getting
domain annotation is crucial for us to characterize our contigs. Using just the
longest orf of all 6 frames is wrong in many cases because of InDels caused by
the assembly process.
What's the behavior of interproscan at a translation stop symbol? Will it
continue afterwards or just stop?
Thanks a lot for your nice feedback.
Best,
Holger
Original comment by openhol...@gmail.com
on 20 Feb 2013 at 10:08
[deleted comment]
Dear Holger,
Thanks for your explanation. Honestly, this is the first time we came across
such an use case. So it is very interesting from my perspective.
But, nothing should stop you from doing the analysis for all 6 frames. Please
feel free to use the Perl script attached, that will take your FASTA file and
split it into sub sequences. The output is a FASTA file with the same header
lines but the ID has "_p" and a number appended to indicate which sub sequence
it is. We also added an option which allow you to specify a minimum orf size.
The script is run as follows:
perl split.pl in_file (min_orf_size)
It prints the results to STDOUT.
>What's the behavior of interproscan at a translation stop symbol? Will it
continue afterwards or just stop?
At the moment it fails as you noticed, but in the new release it will stop
immediately with a warning message that your sequences contain asterix (*).
As InterProScan 5 is depending on lots of external tools and binaries we are
depended on these. For instance lots of our member databases using HMMER
(http://hmmer.janelia.org/) for their functional analysis. We had a look into
the HMMER output file for one of your sequences and noticed that it predicted
domain alignments which contains asterix (*). And this doesn't feel right?
Here the example domain alignment from HMMER:
Alignments for each domain:
== domain 1 score: 39.1 bits; conditional E-value: 4.7e-14
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX RF
2o02A00 40 ernlLsvayknvigarRaswriissieqkekgne...kkvklikeyrekiekeLskicedilelldkhLip 107
+rn+L va i++rR++wr++ i ++ k++ kk +l+k+yr+++e+eLskice++ ell++ L++
1 16 KRNILIVA----INSRRENWRTLRYIDHRRKSQGgh*KKEELVKDYRKELETELSKICEEVKELLNRLLLE 82
67788776....789*************9955553366****************************99986 PP
If you need more help on that please also consider to contact us via EBI
Support & Feedback (http://www.ebi.ac.uk/support/index.php). Definitely more
confidential.
Kind Regards,
Maxim
Original comment by Maxim.Sc...@gmail.com
on 21 Feb 2013 at 11:48
Attachments:
Somehow my comment got list (even if i got a confirmation email). Here it is
again:
Hi Maxim,
thank you for your quick reply.
> At the moment it fails as you noticed, but in the new release it will stop
immediately with a warning message that your sequences contain asterix (*).
This will be different compared to interproscan4, which predicts domains in
sequences that contain translation stop symbols. I agree with you about your
example: Domains that span over a stop-symbol are not meaningful.
However, conceptually I always thought IPS would take any protein-sequence as
input and try to annotate domains with respect to a local context irrespective
of down/upstream elements like translation stops. IPS4 seemed to be consistent
with this idea.
I would prefer if ips5 would not stop at * bust just continue to predict (as
did IPS4), because what we do is to backtranslate the domain regions into
nucleotide coordinates to annotate our contigs along with Blast HSPs. This
becomes much harder if we split the translated output into reading frames for
running the domain prediction. For sure that's a very specific use-case (which
was easy to do with ips4 as explained above), so I'm a little biased here. :-)
Best,
Holger
Original comment by holgerbr...@gmail.com
on 25 Feb 2013 at 1:26
Dear Holger,
Having looked at the InterProScan4 code, it does something very naive and just
strips out all asterisks from the input sequence if it finds them. I imagine
that this is not the behaviour you would have assumed as, i) that concatenates
all the reading frames together into a single peptide and ii) it would mess up
the coordinates of the matches (meaning any mapping back you did to the
nucleotide sequence would be wrong anyway). So first of all, I apologise for
the behaviour in InterProScan 4.
The most important, fundamental thing to remember about InterProScan is that it
is predicting *protein* families and domains. We natively include a 6 frame
translation tool for nucleotide sequences to make people's lives easier but
InterPro's member databases all work on amino acid sequences, and so
InterProScan expects valid amino acid sequences as input. This means separate
fasta headers for each peptide sequence, not a concatenation of peptides under
a single header, separated by asterisks. We can't just leave the asterisks in
the sequence because some of the member databases (e.g. Superfamily) will fail
if you try to search them. So, we have had to make a decision about what types
of sequence inputs to expect and support; this is why it would be very
difficult to support the use case you outline.
Of course, now that InterProScan is an open source project, if you think you
could contribute code to support your use case, you are welcome to do so.
Please let me or my team know if you have any other comments or ideas about
InterProScan.
Best regards
Sarah
Original comment by sarahhun...@gmail.com
on 1 Mar 2013 at 1:16
Hi Sarah,
thank you very much for this enlightening information. So far we've indeed used
IPS making some incorrect assumptions about its processing model. So it's good
to have a clearer picture now.
Now I also understand (and totally agree with) Maxim's idea , that IPS5 should
reject sequences containing translation stops to avoid incorrect predictions.
Thank you very much for your support,
Best regards,
Holger
Original comment by holgerbr...@gmail.com
on 4 Mar 2013 at 10:33
Original comment by Maxim.Sc...@gmail.com
on 4 Mar 2013 at 10:56
Hi Sarah,
sorry for bothering you with this issue again.
As you pointed out in your reply, IPS is stripping out asterisks, which shifts
the downstream domain positions in the output. And depending on the number of
upstream asterisks, the shift can become arbitrarily big.
This did not just break my script initially, but the IPS website itself is
reporting incorrect domain positions for proteins contains stop-codons, and IPS
integrations into major bioinformatics applications (like Geneious) report
incorrect domains positions as well. None of the major IPS tools seem to aware
of the issue.
So even if I fixed the problem for me by splitting up the sequences into *-free
protein chunks and restoring positions afterwards, it's likely that other
people get incorrect results because of this too hidden feature/bug.
Cheers,
Holger
Original comment by holgerbr...@gmail.com
on 2 May 2013 at 9:07
Original issue reported on code.google.com by
holgerbr...@gmail.com
on 19 Feb 2013 at 9:21Attachments: