Open GoogleCodeExporter opened 9 years ago
Hi Marc,
It looks like InterProScan 5 is failing to run and/or parse the bjobs command.
1. Which version of InterProScan 5 are you running?
2. What do you get when you run the following three commands? Change the
commands to suit your setup, but am interested in the outputs:
testRunBjobs.log, testRunLSFBjobs.log
bjobs -P testRun > testRunBjobs.log
bsub -o testRunLSFBjobs.log bjobs -P testRun
3. When you change the grid.name to 'other' and max.tier.depth to 1 in your
interproscan.properties file do you also get problems.
grid.name=other-cluster
max.tier.depth=1
Regards,
Gift
Original comment by nuka....@gmail.com
on 21 Feb 2014 at 2:15
The change for this is done now.
Original comment by nuka....@gmail.com
on 21 Feb 2014 at 2:53
Original comment by Maxim.Sc...@gmail.com
on 21 Feb 2014 at 3:50
Hi,
thanks for looking into this.
- Version is 5.3.46; also redownloaded to verify that it is indeed the latest
version
- Ran it from withon /home this time to make sure that there are no permission
problems of any kind
- Changed interproscan.properties as suggested
- Output when running the test data set attached
- Output from the two suggested commands below:
1) bjobs -P testRun > testRunBjobs.log
No job found in project testRun
-> output file testRunBjobs.log is empty
2) bsub -o testRunLSFBjobs.log bjobs -P testRun
Note (the actual user name was removed by me)
--
Job <14080> is submitted to default queue <normal>.
cat testRunLSFBjobs.log
Sender: LSF System <openlava@bnode-03>
Subject: Job 14081: <bjobs -P testRun> Exited
Job <bjobs -P testRun> was submitted from host <bhead> by user <removed>.
Job was executed on host(s) <bnode-03>, in queue <normal>, as user <removed>.
</home/removed> was used as the home directory.
</home/removed> was used as the working directory.
Started at Fri Feb 21 17:08:20 2014
Results reported at Fri Feb 21 17:08:21 2014
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
bjobs -P testRun
------------------------------------------------------------
Exited with exit code 255.
Resource usage summary:
CPU time : 0.02 sec.
Max Memory : 4 MB
Max Swap : 119 MB
Max Processes : 1
The output (if any) follows:
No job found in project testRun
Original comment by mphoepp...@gmail.com
on 21 Feb 2014 at 4:10
Attachments:
Hi Marc,
It looks like some cluster property values are being ignored. I forgot to
include one command to run in the previous message.
Try to run the following three commands and let me know what you get (make the
necessary changes where needed):
bsub -P testRun -o testRun.log ./interproscan.sh -i test_proteins.fasta -f tsv -o output.tsv
bjobs -P testRun > testRunBjobs.log
bsub -o testRunLSFBjobs.log bjobs -P testRun
Can your run 'bjobs', 'bsub' commands from any machine on the cluster or are
you restricted to a submission node?
Regards,
Gift
Original comment by nuka....@gmail.com
on 21 Feb 2014 at 5:15
Hi,
I thought there must be something missing. Log files attached and yes, I can
submit/query from any node.
The command above was not in clustermode tho, right (?) - but I guess you are
after something else.
Regarding the cluster property values - any chance of finding out how the
program is trying to query those? The OpenLava installation was just done on
the basis of what I needed for my analyses, so there is a chance that some
optional parameters (as far as the functioning of the qeueing system is
concerned) were omitted.
Original comment by mphoepp...@gmail.com
on 21 Feb 2014 at 6:16
Attachments:
After more testing, I have come to a point where the error message is looking
somewhat different:
27/02/2014 08:15:58:929 29% completed
2014-02-27 08:16:41,895
[uk.ac.ebi.interpro.scan.jms.master.DistributedBlackBoxMaster:201] WARN -
StepInstance 4 is being re-run following a failure.
2014-02-27 08:16:41,897
[uk.ac.ebi.interpro.scan.jms.master.DistributedBlackBoxMaster:213] WARN -
StepInstance 4 (stepCoilsRunBinary) will be re-run in a high-memory worker.
When looking in the log files for the clusterrunid, the line that stands out
is:
2014-02-27 08:10:02,609
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed
to connect to [tcp://bnode-03:29432] after: 5 attempt(s)
Now, the compute nodes are all behind a firewall, but should have open
communication between them. I can also e.g. download updated etc on these
nodes, since there traffic is forwarded by a gateway machine.
I also limited lsf submissions to one node, on which I disabled the firewall
completely. Same problem. Is Interproscan trying to establish some sort of TCP
connection to an outside machine?
Original comment by mphoepp...@gmail.com
on 27 Feb 2014 at 7:19
Hi,
From the files you sent me, there was no apparent reason why InterProScan
should fail.
I possibly need to understand your setup more to get to why you are having
these problems.
Shall we deal with this interactively, say we arrange to skype tomorrow between
11:00 and 12:00 UK time or some time next week.
Regards,
Gift
Original comment by nuka....@gmail.com
on 27 Feb 2014 at 3:28
Original comment by Maxim.Sc...@gmail.com
on 15 May 2014 at 9:30
Hello,
I'm trying to set up interproscan 5.11-51.0 using on my SGE cluster. Even when
I try to run interproscan on a cluster with only a single master node (no
slaves) in cluster mode, I encounter the following errors:
21/04/2015 23:57:39:404 Welcome to InterProScan-5.11-51.0
21/04/2015 23:57:52:726 Running InterProScan v5 in DISTRIBUTED_WORKER mode...
2015-04-21 23:58:31,969
[uk.ac.ebi.interpro.scan.jms.activemq.JMSTransportListener:90] WARN - Transport
interrupted for > 10 min
2015-04-21 23:58:31,969
[uk.ac.ebi.interpro.scan.jms.activemq.JMSTransportListener:90] WARN - Transport
interrupted for > 10 min
2015-04-21 23:58:31,969
[uk.ac.ebi.interpro.scan.jms.activemq.JMSTransportListener:90] WARN - Transport
interrupted for > 10 min
2015-04-21 23:58:32,431
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,433
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,434
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,435
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,435 [org.apache.activemq.pool.PooledSession:122] WARN -
Caught exception trying close() when putting session back into the pool, will
invalidate. javax.jms.IllegalStateException: The Session is closed
Any ideas?
Original comment by brya...@gmail.com
on 22 Apr 2015 at 12:03
Hi,
You are most likely not starting interproscan 5 correctly? Can you send us the
command line you use?
Gift
Original comment by nuka....@gmail.com
on 28 Apr 2015 at 1:23
Original issue reported on code.google.com by
mphoepp...@gmail.com
on 18 Feb 2014 at 2:16