[interhelp #14905] Running on LSF-like cluster

GoogleCodeExporter commented 9 years ago

Hi,

I've ben trying to run Interproscan on an LSF-like cluster, specifically on 
OpenLava. Openlava (openlava.org) is a fork of the defunct Platform Lava and 
Platform LSF and is, at least for my purpose so far, syntax-compatible with LSF.

However, when trying to run Interproscan, I get this:

[uk.ac.ebi.interpro.scan.jms.master.DistributedBlackBoxMaster:299] WARN - 
Restarted scheduledExecutorService for updating ClusterState

Works fine if I run it on a single machine tho. Any ideas what sort of problem 
I could be looking at here?

Cheers,

Marc

Original issue reported on code.google.com by mphoepp...@gmail.com on 18 Feb 2014 at 2:16

GoogleCodeExporter commented 9 years ago

Hi Marc,

It looks like InterProScan 5 is failing to run and/or parse the bjobs command.

1. Which version of InterProScan 5 are you running? 

2. What do you get when you run the following three commands? Change the 
commands to suit your setup, but am interested in the outputs: 
testRunBjobs.log, testRunLSFBjobs.log

  bjobs -P testRun > testRunBjobs.log
  bsub -o testRunLSFBjobs.log bjobs -P testRun 

3. When you change the grid.name to 'other' and max.tier.depth to 1 in your 
interproscan.properties file do you also get problems.

grid.name=other-cluster
max.tier.depth=1

Regards,
Gift

Original comment by nuka....@gmail.com on 21 Feb 2014 at 2:15

GoogleCodeExporter commented 9 years ago

The change for this is done now.

Original comment by nuka....@gmail.com on 21 Feb 2014 at 2:53

Changed state: Started

GoogleCodeExporter commented 9 years ago

Original comment by Maxim.Sc...@gmail.com on 21 Feb 2014 at 3:50

Changed title: [interhelp #14905] Running on LSF-like cluster

GoogleCodeExporter commented 9 years ago

Hi, 

thanks for looking into this.

- Version is 5.3.46; also redownloaded to verify that it is indeed the latest 
version
- Ran it from withon /home this time to make sure that there are no permission 
problems of any kind
- Changed interproscan.properties as suggested

- Output when running the test data set attached

- Output from the two suggested commands below:

1) bjobs -P testRun > testRunBjobs.log
No job found in project testRun 
-> output file testRunBjobs.log is empty

2) bsub -o testRunLSFBjobs.log bjobs -P testRun
Note (the actual user name was removed by me)
--
Job <14080> is submitted to default queue <normal>.

cat testRunLSFBjobs.log
Sender: LSF System <openlava@bnode-03>
Subject: Job 14081: <bjobs -P testRun> Exited

Job <bjobs -P testRun> was submitted from host <bhead> by user <removed>.
Job was executed on host(s) <bnode-03>, in queue <normal>, as user <removed>.
</home/removed> was used as the home directory.
</home/removed> was used as the working directory.
Started at Fri Feb 21 17:08:20 2014
Results reported at Fri Feb 21 17:08:21 2014

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
bjobs -P testRun
------------------------------------------------------------

Exited with exit code 255.

Resource usage summary:

    CPU time   :      0.02 sec.
    Max Memory :         4 MB
    Max Swap   :       119 MB

    Max Processes  :         1

The output (if any) follows:

No job found in project testRun

Original comment by mphoepp...@gmail.com on 21 Feb 2014 at 4:10

Attachments:

interproscan_openlava_error.txt

GoogleCodeExporter commented 9 years ago

Hi Marc,

It looks like some cluster property values are being ignored. I forgot to 
include one command to run in the previous message.

Try to run the following three commands and let me know what you get (make the 
necessary changes where needed):
  bsub -P testRun -o testRun.log ./interproscan.sh -i test_proteins.fasta -f tsv -o output.tsv
  bjobs -P testRun > testRunBjobs.log
  bsub -o testRunLSFBjobs.log bjobs -P testRun

Can your run 'bjobs', 'bsub' commands from any machine on the cluster or are 
you restricted to a submission node?

Regards,
Gift

Original comment by nuka....@gmail.com on 21 Feb 2014 at 5:15

GoogleCodeExporter commented 9 years ago

Hi,

I thought there must be something missing. Log files attached and yes, I can 
submit/query from any node. 

The command above was not in clustermode tho, right (?) - but I guess you are 
after something else. 

Regarding the cluster property values - any chance of finding out how the 
program is trying to query those? The OpenLava installation was just done on 
the basis of what I needed for my analyses, so there is a chance that some 
optional parameters (as far as the functioning of the qeueing system is 
concerned) were omitted.

Original comment by mphoepp...@gmail.com on 21 Feb 2014 at 6:16

Attachments:

GoogleCodeExporter commented 9 years ago

After more testing, I have come to a point where the error message is looking 
somewhat different: 

27/02/2014 08:15:58:929 29% completed
2014-02-27 08:16:41,895 
[uk.ac.ebi.interpro.scan.jms.master.DistributedBlackBoxMaster:201] WARN - 
StepInstance 4 is being re-run following a failure.
2014-02-27 08:16:41,897 
[uk.ac.ebi.interpro.scan.jms.master.DistributedBlackBoxMaster:213] WARN - 
StepInstance 4 (stepCoilsRunBinary) will be re-run in a high-memory worker.

When looking in the log files for the clusterrunid, the line that stands out 
is: 

2014-02-27 08:10:02,609 
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed 
to connect to [tcp://bnode-03:29432] after: 5 attempt(s)

Now, the compute nodes are all behind a firewall, but should have open 
communication between them. I can also e.g. download updated etc on these 
nodes, since there traffic is forwarded by a gateway machine. 

I also limited lsf submissions to one node, on which I disabled the firewall 
completely. Same problem. Is Interproscan trying to establish some sort of TCP 
connection to an outside machine?

Original comment by mphoepp...@gmail.com on 27 Feb 2014 at 7:19

GoogleCodeExporter commented 9 years ago

Hi,

From the files you sent me, there was no apparent reason why InterProScan 
should fail.

I possibly need to understand your setup more to get to why you are having 
these problems.  

Shall we deal with this interactively, say we arrange to skype tomorrow between 
11:00 and 12:00 UK time or some time next week. 

Regards,
Gift

Original comment by nuka....@gmail.com on 27 Feb 2014 at 3:28

GoogleCodeExporter commented 9 years ago

Original comment by Maxim.Sc...@gmail.com on 15 May 2014 at 9:30

Added labels: Subject-ClusterMode

GoogleCodeExporter commented 9 years ago

Hello,

I'm trying to set up interproscan 5.11-51.0 using on my SGE cluster. Even when 
I try to run interproscan on a cluster with only a single master node (no 
slaves) in cluster mode, I encounter the following errors:

21/04/2015 23:57:39:404 Welcome to InterProScan-5.11-51.0
21/04/2015 23:57:52:726 Running InterProScan v5 in DISTRIBUTED_WORKER mode...
2015-04-21 23:58:31,969 
[uk.ac.ebi.interpro.scan.jms.activemq.JMSTransportListener:90] WARN - Transport 
interrupted  for > 10 min
2015-04-21 23:58:31,969 
[uk.ac.ebi.interpro.scan.jms.activemq.JMSTransportListener:90] WARN - Transport 
interrupted  for > 10 min
2015-04-21 23:58:31,969 
[uk.ac.ebi.interpro.scan.jms.activemq.JMSTransportListener:90] WARN - Transport 
interrupted  for > 10 min
2015-04-21 23:58:32,431 
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed 
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,433 
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed 
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,434 
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed 
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,435 
[org.apache.activemq.transport.failover.FailoverTransport:1026] ERROR - Failed 
to connect to [tcp://master:30353] after: 5 attempt(s)
2015-04-21 23:58:32,435 [org.apache.activemq.pool.PooledSession:122] WARN - 
Caught exception trying close() when putting session back into the pool, will 
invalidate. javax.jms.IllegalStateException: The Session is closed

Any ideas?

Original comment by brya...@gmail.com on 22 Apr 2015 at 12:03

GoogleCodeExporter commented 9 years ago

Hi,
You are most likely not starting interproscan 5 correctly? Can you send us the 
command line you use?

Gift

Original comment by nuka....@gmail.com on 28 Apr 2015 at 1:23

reage / interproscan

[interhelp #14905] Running on LSF-like cluster #34