Open GoogleCodeExporter opened 8 years ago
From what I can see, it seems InterProscan is not even trying to launch a qsub
command. SGE logs stay empty...
Original comment by chirofr...@gmail.com
on 27 Nov 2014 at 5:00
Original comment by Maxim.Sc...@gmail.com
on 1 Dec 2014 at 10:45
Hi,
Could you please send us the qsub command adjustment you made. We do have a
small SGE cluster on site where we can test such cases.
Have you tried to run the qsub adjustment outside of InterProScan?
In terms of debugging the issue, could you please try to set the following 2
properties in your properties file:
verbose.log=true
verbose.log.level=4
This will activate the verbose log and a log level of 4 is properly the most
detailed log you can get.
Kind Regards,
Maxim
Original comment by Maxim.Sc...@gmail.com
on 2 Dec 2014 at 4:08
Following are the qsub config entries:
grid.master.submit.command=qsub -hard -q main,main2,main3 -l vf=2G,h_vmem=6G
-pe make 4 -binding linear:4 -N job_iprscan5t1worker
grid.master.submit.high.memory.command=qsub -hard -q main,main2,main3 -l
vf=2G,h_vmem=6G -pe make 4 -binding linear:4 -N job_iprscan5t1worker
#commands a worker uses to start new remote workers
grid.worker.submit.command=qsub -hard -q main,main2,main3 -l -l
vf=8G,h_vmem=14G -pe make 4 -binding linear:4 -N job_iprscan5t2worker
grid.worker.submit.high.memory.command=qsub -hard -q main,main2,main3 -l
vf=8G,h_vmem=14G -pe make 4 -binding linear:4 -N job_iprscan5t2worker
Also, I gave a default sge_request configuration with these settings:
-w w
-o /data/workspace/$USER/grid-jobs/logs/$JOB_NAME.o$JOB_ID
-e /data/workspace/$USER/grid-jobs/logs/$JOB_NAME.e$JOB_ID
-S /bin/bash
-j n
-cwd
-p -100
-m a
Original comment by chirofr...@gmail.com
on 2 Dec 2014 at 6:23
It seems you are not creating a job script file, but are directly feeding the
full command to qsub. For this I need to add the "-b y" to qsub to make this
work.
Will test this first...
Original comment by chirofr...@gmail.com
on 2 Dec 2014 at 6:32
The verbose option is a very helper. After doing some testing,
I figured out it does not work unless I am "in the installation path" of
InterproScan (or he does not find back the interproscan.properties file and/or
.jar file).
Also I changed my specific options:
* Added: -cwd -V -b y
* Removed: -binding linear:4
Now I see jobs launching, but immediatly quitting (no log files or created). I
get mails with this output:
failed before writing exit_status:shepherd exited with exit status 19: before
writing exit_status
Shepherd trace:
12/02/2014 20:22:54 [0:8802]: shepherd called with uid = 0, euid = 0
12/02/2014 20:22:54 [0:8802]: starting up 2011.11p1
12/02/2014 20:22:54 [0:8802]: setpgid(8802, 8802) returned 0
12/02/2014 20:22:54 [0:8802]: do_core_binding: "binding" parameter not found in
config file
12/02/2014 20:22:54 [0:8802]: no prolog script to start
12/02/2014 20:22:54 [0:8802]: /bin/true
12/02/2014 20:22:54 [0:8802]: /bin/true
12/02/2014 20:22:54 [0:8802]: parent: forked "pe_start" with pid 8804
12/02/2014 20:22:54 [0:8804]: child: starting son(pe_start, /bin/true, 0);
12/02/2014 20:22:54 [0:8802]: using signal delivery delay of 120 seconds
12/02/2014 20:22:54 [0:8802]: parent: pe_start-pid: 8804
12/02/2014 20:22:54 [0:8804]: pid=8804 pgrp=8804 sid=8804 old pgrp=8802
getlogin()=<no login set>
12/02/2014 20:22:54 [0:8804]: reading passwd information for user 'bbnof'
12/02/2014 20:22:54 [0:8804]: setting limits
12/02/2014 20:22:54 [0:8804]: setting environment
12/02/2014 20:22:54 [0:8804]: Initializing error file
12/02/2014 20:22:54 [0:8804]: switching to intermediate/target user
12/02/2014 20:22:54 [1289:8804]: closing all filedescriptors
12/02/2014 20:22:54 [1289:8804]: further messages are in "error" and "trace"
12/02/2014 20:22:54 [1289:8804]: using "/bin/bash" as shell of user "bbnof"
12/02/2014 20:22:54 [1289:8804]: now running with uid=1289, euid=1289
12/02/2014 20:22:54 [1289:8804]: execvp(/bin/true, "/bin/true")
12/02/2014 20:22:54 [0:8802]: wait3 returned 8804 (status: 0; WIFSIGNALED: 0,
WIFEXITED: 1, WEXITSTATUS: 0)
12/02/2014 20:22:54 [0:8802]: pe_start exited with exit status 0
12/02/2014 20:22:54 [0:8802]: reaped "pe_start" with pid 8804
12/02/2014 20:22:54 [0:8802]: pe_start exited not due to signal
12/02/2014 20:22:54 [0:8802]: pe_start exited with status 0
Shepherd pe_hostfile:
<nodename> 4 <queue>@<nodename> UNDEFINED
I removed some lines from above output, also removed nodenames/queues.
The verbose log of Interproscan says nothing...
When I launch the qsub script manually, the job keeps running, but there is no
master process => so lots of errors in the error log.
Original comment by chirofr...@gmail.com
on 2 Dec 2014 at 7:25
Hi
I am unable to find anything more without more debugging from SGE level. So I
would like to enable SGE_DEBUG_LEVEL and catch the full output of the qsub
command.
Via InterproScan I do not see any output of qsub. Is there a way of printing
this output on stderr/stdout (or a file)?
This would help me a lot!
Thanks
Filip
Original comment by chirofr...@gmail.com
on 9 Dec 2014 at 4:31
Hi,
Is anyone able to assist me with this? We are blocked right now...
Regards,
Filip Nollet
Original comment by chirofr...@gmail.com
on 7 Jan 2015 at 12:57
Hi,
In a previous message my colleague suggested you change the following
properties in the interproscan.properties
verbose.log=true
verbose.log.level=4
Can you send us the output from the main interproscan process after you make
these changes.
What kind of cluster setup do you have? Is the submission node on the same
network as the cluster computing nodes or are the cluster nodes behind a
firewall?
Regards,
Gift
Original comment by nuka....@gmail.com
on 8 Jan 2015 at 10:41
Hi Filip,
I have run Interproscan on our SGE cluster with verbose output turned on as
described above and am able to get the command that is submitted to the
cluster. I also don't get any errors and the jobs completes successfully.
The command I use to start the master interproscan process is:
qsub -cwd -V -b y -N testsge001 -o masterlogs-testsge001.log.out -e
masterlogs-testsge001.log.err interproscan.sh -i test_proteins.fasta -dp -b
testsge001 -f tsv,xml -mode cluster -crid test1
Let us know what you as the output from the main interproscan process.
Regards,
Gift
Original comment by nuka....@gmail.com
on 12 Jan 2015 at 4:43
Hi
This is a full output of interproscan. As you will be able to see: the qsub
commands pass, but any job is aborted immediatly (see the error I get by mail
in my previous commands).
Original comment by chirofr...@gmail.com
on 14 Jan 2015 at 10:49
Attachments:
Question: why are you running a command to start the "master" interproscan
process?
I do not see any of such command in the default output I attached? I only see
worker commands being launched.
About my cluster setup:
* All RedHat Enterprise 6.6 64 bit
* All exec nodes are also submission nodes
* Login node is of course also submission node
* There are no firewall restriction between any of the login/exec nodes
Original comment by chirofr...@gmail.com
on 14 Jan 2015 at 10:53
ok, I understand the master job thing. You are launching the master process as
a seperate job into the cluster. We run this process directly on the login node
itself.
Original comment by chirofr...@gmail.com
on 14 Jan 2015 at 10:56
The error you get is coming from SGE before it launches a new interproscan
worker job. I can see the qsub command that your master is spawning. Can you
try to run that command and see what result (in
logs/test_bbnof/test_bbnof_976089489_nw_00.out.0) you get. i.e., run the
command on the line that starts 'command to submit to cluster', something like
qsub -cwd -V -b y -N job_iprscan5t1worker -q main@biogridn9 -pe make 4 -o
logs/test_bbnof/test_bbnof_976089489_nw_00.out.0 -e
logs/test_bbnof/test_bbnof_976089489_nw_00.err.0 java -Xms512m -Xmx2048m -jar
interproscan-5.jar --mode=distributed_worker --priority=4
--masteruri=tcp://gquest.be.bayercropscience:64171
--tempdirname=gquest_20150114_114606190_d6sp
--userdir=/tools/bioinfo/app/interproscan-5.8 --tier1=1
--clusterrunid=test_bbnof --mastermaxlife=1421232386192:21562457
I launch the master process to the cluster, but running the master process
directly should not behave differently with the cluster setup you have.
Regards,
Gift
Original comment by nuka....@gmail.com
on 14 Jan 2015 at 11:19
Hi
I tried what you said. I launched the interproscan run and waited for the
"worker" to be displayed in the verbose output. When I take this command and
launch it manually; the jobs seems to be running and finishing well!
I guess there is something wrong with the environment in which the java process
launches this job??? No idea why it fails there/then.
Original comment by chirofr...@gmail.com
on 14 Jan 2015 at 1:38
At the moment, I would advise you run Interproscan in the default 'black box'
mode. If you have large files to analyse, you can chunk them into smaller
files.
There is not enough info to know what might be happening, whether its the java
, sge environment or something else. In the next release, we will add more
debug at the point the qsub command is launched, so that we can capture more
info than just the exit status of the bsub command.
Regards,
Gift
Original comment by nuka....@gmail.com
on 14 Jan 2015 at 2:44
ok, I will pass the message.
Just as a last question, when can this new version be expected?
Original comment by chirofr...@gmail.com
on 14 Jan 2015 at 2:56
How small should the chunks of files be?
Original comment by stefanie...@gmail.com
on 15 Jan 2015 at 11:06
For us on a 16 core machine with the following properties set, we make 5000
sequences per chunk.
#number of embedded workers in a remote worker
worker.number.of.embedded.workers=1
worker.maxnumber.of.embedded.workers=4
Cheers,
Gift
Original comment by nuka....@gmail.com
on 20 Jan 2015 at 10:02
Original issue reported on code.google.com by
chirofr...@gmail.com
on 27 Nov 2014 at 4:59