Attempting to create a Gene3D match that has no alignment data

GoogleCodeExporter commented 8 years ago

This is what I try:
> interproscan.sh -i test.fa -f tsv --goterms --pathways
Welcome to InterProScan 5RC4
Running the following analyses:
[jobProDom-2006.1, jobPanther-7.2, jobGene3d-3.3.0, jobSMART-6.2, 
jobTIGRFAM-12.0, jobPfamA-26.0, jobSuperFamily-1.75, jobPrositePatterns-20.83, 
jobPRINTS-42.0, jobPIRSF-2.82, jobPrositeProfiles-20.83, jobHAMAP-201207.04, 
jobCoils-2.2]
Running InterProScan v5 in STANDALONE mode...

with this java:
java version "1.7.0_13"
Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

After some seconds it crashes with the following stacktrace:

java.lang.IllegalStateException: Attempting to create a Gene3D match that has 
no alignment data.
    at uk.ac.ebi.interpro.scan.io.match.hmmer.hmmer3.Gene3DHmmer3ParserSupport.createMatch(Gene3DHmmer3ParserSupport.java:32)
    at uk.ac.ebi.interpro.scan.io.match.hmmer.hmmer3.Gene3DHmmer3ParserSupport.createMatch(Gene3DHmmer3ParserSupport.java:16)
    at uk.ac.ebi.interpro.scan.io.match.hmmer.hmmer3.AbstractHmmer3ParserSupport.addMatch(AbstractHmmer3ParserSupport.java:85)
    at uk.ac.ebi.interpro.scan.io.match.hmmer.hmmer3.Gene3DHmmer3ParserSupport.addMatch(Gene3DHmmer3ParserSupport.java:16)
    at uk.ac.ebi.interpro.scan.io.match.hmmer.hmmer3.Hmmer3SearchMatchParser.parse(Hmmer3SearchMatchParser.java:183)
    at uk.ac.ebi.interpro.scan.management.model.implementations.ParseStep.execute(ParseStep.java:66)
    at uk.ac.ebi.interpro.scan.jms.activemq.StepExecutionTransactionImpl.executeInTransaction(StepExecutionTransactionImpl.java:84)
    at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:319)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
    at sun.proxy.$Proxy94.executeInTransaction(Unknown Source)
    at uk.ac.ebi.interpro.scan.jms.activemq.WorkerListener.onMessage(WorkerListener.java:102)

Is there any workaround to make interproscan to ignore the error and continue 
with the next sequence? Or can I prefilter my sequences to work around the 
problem?

Original issue reported on code.google.com by holgerbr...@gmail.com on 19 Feb 2013 at 9:21

Attachments:

test.fa

GoogleCodeExporter commented 8 years ago

Dear Holger,
Thank you for sending the input file. I will try to reproduce the error and 
will get back to you with an update. Unfortunately there is no workaround 
except removing the causing sequence.

Original comment by Maxim.Sc...@gmail.com on 19 Feb 2013 at 3:06

Changed state: Started

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Could reproduced the error.

Original comment by Maxim.Sc...@gmail.com on 20 Feb 2013 at 9:52

GoogleCodeExporter commented 8 years ago

Dear Holger,
I've just noticed that your FASTA sequences are invalid as they contain plenty 
of translation stop (*). My guess is that your sequences are a result of a ORF 
prediction tool, is that correct? Unfortunately InterProScan doesn't print out 
a warning what it should do in theory. Thank you again for reporting this. We 
will fix that as soon as possible.
And yes we would suggest to pre-filter your sequences and re-analysis them with 
again with InterProScan 5 afterwards.

Kind Regards,
Maxim

Original comment by Maxim.Sc...@gmail.com on 20 Feb 2013 at 2:34

Changed state: Started

GoogleCodeExporter commented 8 years ago

Hi Maxim,

I'm aware that we have lots of translation stops. The sequences are contigs 
from a de-novo assembled transcriptome (http://trinityrnaseq.sourceforge.net/) 
which I've just translated into all 6 reading frames. Thus it's not surprising 
to have many translation stops in many of the sequences. However, getting 
domain annotation is crucial for us to characterize our contigs. Using just the 
longest orf of all 6 frames is wrong in many cases because of InDels caused by 
the assembly process. 

What's the behavior of interproscan at a translation stop symbol? Will it 
continue afterwards or just stop?

Thanks a lot for your nice feedback.

Best,
Holger

Original comment by openhol...@gmail.com on 20 Feb 2013 at 10:08

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Dear Holger,
Thanks for your explanation. Honestly, this is the first time we came across 
such an use case. So it is very interesting from my perspective.
But, nothing should stop you from doing the analysis for all 6 frames. Please 
feel free to use the Perl script attached, that will take your FASTA file and 
split it into sub sequences. The output is a FASTA file with the same header 
lines but the ID has "_p" and a number appended to indicate which sub sequence 
it is. We also added an option which allow you to specify a minimum orf size.
The script is run as follows:
perl split.pl in_file (min_orf_size)

It prints the results to STDOUT.

>What's the behavior of interproscan at a translation stop symbol? Will it 
continue afterwards or just stop?

At the moment it fails as you noticed, but in the new release it will stop 
immediately with a warning message that your sequences contain asterix (*).

As InterProScan 5 is depending on lots of external tools and binaries we are 
depended on these. For instance lots of our member databases using HMMER 
(http://hmmer.janelia.org/) for their functional analysis. We had a look into 
the HMMER output file for one of your sequences and noticed that it predicted 
domain alignments which contains asterix (*). And this doesn't feel right?

Here the example domain alignment from HMMER:
Alignments for each domain:
  == domain 1    score: 39.1 bits;  conditional E-value: 4.7e-14
              XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX RF
  2o02A00  40 ernlLsvayknvigarRaswriissieqkekgne...kkvklikeyrekiekeLskicedilelldkhLip 107
              +rn+L va    i++rR++wr++  i ++ k++    kk +l+k+yr+++e+eLskice++ ell++ L++
        1  16 KRNILIVA----INSRRENWRTLRYIDHRRKSQGgh*KKEELVKDYRKELETELSKICEEVKELLNRLLLE 82 
              67788776....789*************9955553366****************************99986 PP

If you need more help on that please also consider to contact us via EBI 
Support & Feedback (http://www.ebi.ac.uk/support/index.php). Definitely more 
confidential.

Kind Regards,
Maxim

Original comment by Maxim.Sc...@gmail.com on 21 Feb 2013 at 11:48

Changed state: Started

Attachments:

split.pl

GoogleCodeExporter commented 8 years ago

Somehow my comment got list (even if i got a confirmation email). Here it is 
again:

Hi Maxim,

thank you for your quick reply.

> At the moment it fails as you noticed, but in the new release it will stop 
immediately with a warning message that your sequences contain asterix (*).
This will be different compared to interproscan4, which predicts domains in 
sequences that contain translation stop symbols. I agree with you about your 
example: Domains that span over a stop-symbol are not meaningful. 

However, conceptually I always thought IPS would take any protein-sequence as 
input and try to annotate domains with respect to a local context irrespective 
of down/upstream elements like translation stops. IPS4 seemed to be consistent 
with this idea.

I would prefer if ips5 would not stop at * bust just continue to predict (as 
did IPS4), because what we do is to backtranslate the domain regions into 
nucleotide coordinates to annotate our contigs along with Blast HSPs. This 
becomes much harder if we split the translated output into reading frames for 
running the domain prediction. For sure that's a very specific use-case (which 
was easy to do with ips4 as explained above), so I'm a little biased here. :-)

Best,
Holger

Original comment by holgerbr...@gmail.com on 25 Feb 2013 at 1:26

GoogleCodeExporter commented 8 years ago

Dear Holger,

Having looked at the InterProScan4 code, it does something very naive and just 
strips out all asterisks from the input sequence if it finds them.  I imagine 
that this is not the behaviour you would have assumed as, i) that concatenates 
all the reading frames together into a single peptide and ii) it would mess up 
the coordinates of the matches (meaning any mapping back you did to the 
nucleotide sequence would be wrong anyway).  So first of all, I apologise for 
the behaviour in InterProScan 4.

The most important, fundamental thing to remember about InterProScan is that it 
is  predicting *protein* families and domains.  We natively include a 6 frame 
translation tool for nucleotide sequences to make people's lives easier but 
InterPro's member databases all work on amino acid sequences, and so 
InterProScan expects valid amino acid sequences as input.  This means separate 
fasta headers for each peptide sequence, not a concatenation of peptides under 
a single header, separated by asterisks.  We can't just leave the asterisks in 
the sequence because some of the member databases (e.g. Superfamily) will fail 
if you try to search them.  So, we have had to make a decision about what types 
of sequence inputs to expect and support; this is why it would be very 
difficult to support the use case you outline.

Of course, now that InterProScan is an open source project, if you think you 
could contribute code to support your use case, you are welcome to do so.  

Please let me or my team know if you have any other comments or ideas about 
InterProScan.  

Best regards
Sarah

Original comment by sarahhun...@gmail.com on 1 Mar 2013 at 1:16

GoogleCodeExporter commented 8 years ago

Hi Sarah,

thank you very much for this enlightening information. So far we've indeed used 
IPS making some incorrect assumptions about its processing model. So it's good 
to have a clearer picture now. 

Now I also understand (and totally agree with) Maxim's idea , that IPS5 should 
reject sequences containing translation stops to avoid incorrect predictions. 

Thank you very much for your support,
Best regards,
Holger

Original comment by holgerbr...@gmail.com on 4 Mar 2013 at 10:33

GoogleCodeExporter commented 8 years ago

Original comment by Maxim.Sc...@gmail.com on 4 Mar 2013 at 10:56

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Hi Sarah,

sorry for bothering you with this issue again.

As you pointed out in your reply, IPS is stripping out asterisks, which shifts 
the downstream domain positions in the output. And depending on the number of 
upstream asterisks, the shift can become arbitrarily big. 

This did not just break my script initially, but the IPS website itself is 
reporting incorrect domain positions for proteins contains stop-codons, and IPS 
integrations into major bioinformatics applications (like Geneious) report 
incorrect domains positions as well. None of the major IPS tools seem to aware 
of the issue.

So even if I fixed the problem for me by splitting up the sequences into *-free 
protein chunks and restoring positions afterwards, it's likely that other 
people get incorrect results because of this too hidden feature/bug.

Cheers,
Holger

Original comment by holgerbr...@gmail.com on 2 May 2013 at 9:07

thiagomaframg / interproscan

Attempting to create a Gene3D match that has no alignment data #16