[interhelp #31898] SGE submission does not happen

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Deploy InterproScan 5.8
2. Adjust the qsub commands (following the faq)
3. Run...

What is the expected output? What do you see instead?
When the master is allowed to run commands, I see these being launched and the 
analysis finishes ok.
When disabling this feature (so working via SGE only), no jobs are launched, no 
errors are give. The job just hangs. 
Console output:
27/11/2014 17:53:28:425 Welcome to InterProScan-5.8-49.0
The Project/Cluster Run ID for this run is: iprscan_bbnof
27/11/2014 17:53:38:521 Running InterProScan v5 in CLUSTER mode...
Loading file /home/bbnof/workspace/personal/SOFT-563/sub.1000_plant_lorf.fasta
27/11/2014 17:53:44:927 Running the following analyses:
[jobTIGRFAM-13.0,jobProDom-2006.1,jobPIRSF-2.84,jobPfamA-27.0,jobPrositeProfiles
-20.105,jobSMART-6.2,jobHAMAP-201311.27,jobPrositePatterns-20.105,jobPRINTS-42.0
,jobSuperFamily-1.75,jobCoils-2.2,jobGene3d-3.5.0]
Pre-calculated match lookup service DISABLED.  Please wait for match 
calculations to complete...

What version of the product are you using? On what operating system?
InterProScan 5.8-49
RedHat 6.x (64 bit)
Java 6 or 7 (either fail)

Please provide any additional information below.
It would be very helpful to have a debug functionality or even some more 
logging so we can see if/where this is failing/stops working.

Original issue reported on code.google.com by chirofr...@gmail.com on 27 Nov 2014 at 4:59

GoogleCodeExporter commented 9 years ago

From what I can see, it seems InterProscan is not even trying to launch a qsub 
command. SGE logs stay empty...

Original comment by chirofr...@gmail.com on 27 Nov 2014 at 5:00

GoogleCodeExporter commented 9 years ago

Original comment by Maxim.Sc...@gmail.com on 1 Dec 2014 at 10:45

Changed title: [interhelp #31898] SGE submission does not happen

GoogleCodeExporter commented 9 years ago

Hi,
Could you please send us the qsub command adjustment you made. We do have a 
small SGE cluster on site where we can test such cases.

Have you tried to run the qsub adjustment outside of InterProScan?

In terms of debugging the issue, could you please try to set the following 2 
properties in your properties file:

verbose.log=true
verbose.log.level=4

This will activate the verbose log and a log level of 4 is properly the most 
detailed log you can get.

Kind Regards,
Maxim

Original comment by Maxim.Sc...@gmail.com on 2 Dec 2014 at 4:08

GoogleCodeExporter commented 9 years ago

Following are the qsub config entries: 
grid.master.submit.command=qsub -hard -q main,main2,main3 -l vf=2G,h_vmem=6G 
-pe make 4 -binding linear:4 -N job_iprscan5t1worker
grid.master.submit.high.memory.command=qsub -hard -q main,main2,main3 -l 
vf=2G,h_vmem=6G -pe make 4 -binding linear:4  -N job_iprscan5t1worker

#commands a worker uses to start new remote workers
grid.worker.submit.command=qsub -hard -q main,main2,main3 -l -l 
vf=8G,h_vmem=14G -pe make 4 -binding linear:4  -N job_iprscan5t2worker
grid.worker.submit.high.memory.command=qsub -hard -q main,main2,main3 -l 
vf=8G,h_vmem=14G -pe make 4 -binding linear:4  -N job_iprscan5t2worker

Also, I gave a default sge_request configuration with these settings:
-w w
-o /data/workspace/$USER/grid-jobs/logs/$JOB_NAME.o$JOB_ID
-e /data/workspace/$USER/grid-jobs/logs/$JOB_NAME.e$JOB_ID
-S /bin/bash
-j n
-cwd
-p -100
-m a

Original comment by chirofr...@gmail.com on 2 Dec 2014 at 6:23

GoogleCodeExporter commented 9 years ago

It seems you are not creating a job script file, but are directly feeding the 
full command to qsub. For this I need to add the "-b y" to qsub to make this 
work. 
Will test this first...

Original comment by chirofr...@gmail.com on 2 Dec 2014 at 6:32

GoogleCodeExporter commented 9 years ago

The verbose option is a very helper. After doing some testing, 

I figured out it does not work unless I am "in the installation path" of 
InterproScan (or he does not find back the interproscan.properties file and/or 
.jar file).
Also I changed my specific options:
* Added:  -cwd -V -b y 
* Removed: -binding linear:4

Now I see jobs launching, but immediatly quitting (no log files or created). I 
get mails with this output:

failed before writing exit_status:shepherd exited with exit status 19: before 
writing exit_status
Shepherd trace:
12/02/2014 20:22:54 [0:8802]: shepherd called with uid = 0, euid = 0
12/02/2014 20:22:54 [0:8802]: starting up 2011.11p1
12/02/2014 20:22:54 [0:8802]: setpgid(8802, 8802) returned 0
12/02/2014 20:22:54 [0:8802]: do_core_binding: "binding" parameter not found in 
config file
12/02/2014 20:22:54 [0:8802]: no prolog script to start
12/02/2014 20:22:54 [0:8802]: /bin/true
12/02/2014 20:22:54 [0:8802]: /bin/true
12/02/2014 20:22:54 [0:8802]: parent: forked "pe_start" with pid 8804
12/02/2014 20:22:54 [0:8804]: child: starting son(pe_start, /bin/true, 0);
12/02/2014 20:22:54 [0:8802]: using signal delivery delay of 120 seconds
12/02/2014 20:22:54 [0:8802]: parent: pe_start-pid: 8804
12/02/2014 20:22:54 [0:8804]: pid=8804 pgrp=8804 sid=8804 old pgrp=8802 
getlogin()=<no login set>
12/02/2014 20:22:54 [0:8804]: reading passwd information for user 'bbnof'
12/02/2014 20:22:54 [0:8804]: setting limits
12/02/2014 20:22:54 [0:8804]: setting environment
12/02/2014 20:22:54 [0:8804]: Initializing error file
12/02/2014 20:22:54 [0:8804]: switching to intermediate/target user
12/02/2014 20:22:54 [1289:8804]: closing all filedescriptors
12/02/2014 20:22:54 [1289:8804]: further messages are in "error" and "trace"
12/02/2014 20:22:54 [1289:8804]: using "/bin/bash" as shell of user "bbnof"
12/02/2014 20:22:54 [1289:8804]: now running with uid=1289, euid=1289
12/02/2014 20:22:54 [1289:8804]: execvp(/bin/true, "/bin/true")
12/02/2014 20:22:54 [0:8802]: wait3 returned 8804 (status: 0; WIFSIGNALED: 0,  
WIFEXITED: 1, WEXITSTATUS: 0)
12/02/2014 20:22:54 [0:8802]: pe_start exited with exit status 0
12/02/2014 20:22:54 [0:8802]: reaped "pe_start" with pid 8804
12/02/2014 20:22:54 [0:8802]: pe_start exited not due to signal
12/02/2014 20:22:54 [0:8802]: pe_start exited with status 0

Shepherd pe_hostfile:
<nodename> 4 <queue>@<nodename> UNDEFINED

I removed some lines from above output, also removed nodenames/queues.

The verbose log of Interproscan says nothing...

When I launch the qsub script manually, the job keeps running, but there is no 
master process => so lots of errors in the error log.

Original comment by chirofr...@gmail.com on 2 Dec 2014 at 7:25

GoogleCodeExporter commented 9 years ago

Hi

I am unable to find anything more without more debugging from SGE level. So I 
would like to enable SGE_DEBUG_LEVEL and catch the full output of the qsub 
command.

Via InterproScan I do not see any output of qsub. Is there a way of printing 
this output on stderr/stdout (or a file)? 

This would help me a lot!

Thanks
Filip

Original comment by chirofr...@gmail.com on 9 Dec 2014 at 4:31

GoogleCodeExporter commented 9 years ago

Hi, 

Is anyone able to assist me with this? We are blocked right now...

Regards,
Filip Nollet

Original comment by chirofr...@gmail.com on 7 Jan 2015 at 12:57

GoogleCodeExporter commented 9 years ago

Hi,

In a previous message my colleague suggested you change the following 
properties in the interproscan.properties
verbose.log=true
verbose.log.level=4

Can you send us the output from the main interproscan process after you make 
these changes. 

What kind of cluster setup do you have? Is the submission node on the same 
network as the cluster computing nodes or are the cluster nodes behind a 
firewall?

Regards,
Gift

Original comment by nuka....@gmail.com on 8 Jan 2015 at 10:41

GoogleCodeExporter commented 9 years ago

Hi Filip,

I have run Interproscan on our SGE cluster with verbose output turned on as 
described above and am able to get the command that is submitted to the 
cluster. I also don't get any errors and the jobs completes successfully.

The command I use to start the master interproscan process is:
qsub -cwd -V -b y -N testsge001 -o masterlogs-testsge001.log.out -e 
masterlogs-testsge001.log.err interproscan.sh -i test_proteins.fasta -dp -b 
testsge001 -f tsv,xml -mode cluster -crid test1

Let us know what you as the output from the main interproscan process.

Regards,
Gift

Original comment by nuka....@gmail.com on 12 Jan 2015 at 4:43

GoogleCodeExporter commented 9 years ago

Hi

This is a full output of interproscan. As you will be able to see: the qsub 
commands pass, but any job is aborted immediatly (see the error I get by mail 
in my previous commands).

Original comment by chirofr...@gmail.com on 14 Jan 2015 at 10:49

Attachments:

iprscan_debug_output.txt

GoogleCodeExporter commented 9 years ago

Question: why are you running a command to start the "master" interproscan 
process? 
I do not see any of such command in the default output I attached? I only see 
worker commands being launched.

About my cluster setup:
* All RedHat Enterprise 6.6 64 bit 
* All exec nodes are also submission nodes 
* Login node is of course also submission node
* There are no firewall restriction between any of the login/exec nodes

Original comment by chirofr...@gmail.com on 14 Jan 2015 at 10:53

GoogleCodeExporter commented 9 years ago

ok, I understand the master job thing. You are launching the master process as 
a seperate job into the cluster. We run this process directly on the login node 
itself.

Original comment by chirofr...@gmail.com on 14 Jan 2015 at 10:56

GoogleCodeExporter commented 9 years ago

The error you get is coming from SGE before it launches a new interproscan 
worker job. I can see the qsub command that your master is spawning. Can you 
try to run that command and see what result (in 
logs/test_bbnof/test_bbnof_976089489_nw_00.out.0) you get. i.e., run the 
command on the line that starts 'command to submit to cluster',  something like

qsub -cwd -V -b y -N job_iprscan5t1worker -q main@biogridn9 -pe make 4  -o 
logs/test_bbnof/test_bbnof_976089489_nw_00.out.0 -e 
logs/test_bbnof/test_bbnof_976089489_nw_00.err.0 java -Xms512m -Xmx2048m -jar 
interproscan-5.jar --mode=distributed_worker --priority=4 
--masteruri=tcp://gquest.be.bayercropscience:64171 
--tempdirname=gquest_20150114_114606190_d6sp 
--userdir=/tools/bioinfo/app/interproscan-5.8 --tier1=1 
--clusterrunid=test_bbnof --mastermaxlife=1421232386192:21562457

I launch the master process to the cluster, but running the master process 
directly should not behave differently with the cluster setup you have.

Regards,
Gift

Original comment by nuka....@gmail.com on 14 Jan 2015 at 11:19

GoogleCodeExporter commented 9 years ago

Hi

I tried what you said. I launched the interproscan run and waited for the 
"worker" to be displayed in the verbose output. When I take this command and 
launch it manually; the jobs seems to be running and finishing well!

I guess there is something wrong with the environment in which the java process 
launches this job??? No idea why it fails there/then.

Original comment by chirofr...@gmail.com on 14 Jan 2015 at 1:38

GoogleCodeExporter commented 9 years ago

At the moment, I would advise you run Interproscan in the default 'black box' 
mode. If you have large files to analyse, you can chunk them into smaller 
files. 

There is not enough info to know what might be happening, whether its the java 
, sge environment or something else.  In the next release, we will add more 
debug at the point the qsub command is launched, so that we can capture more 
info than just the exit status of the bsub command.

Regards,
Gift

Original comment by nuka....@gmail.com on 14 Jan 2015 at 2:44

GoogleCodeExporter commented 9 years ago

ok, I will pass the message.

Just as a last question, when can this new version be expected?

Original comment by chirofr...@gmail.com on 14 Jan 2015 at 2:56

GoogleCodeExporter commented 9 years ago

How small should the chunks of files be?

Original comment by stefanie...@gmail.com on 15 Jan 2015 at 11:06

GoogleCodeExporter commented 9 years ago

For us on a 16 core machine with the following properties set, we make 5000 
sequences per chunk.
#number of embedded workers in a remote worker
worker.number.of.embedded.workers=1
worker.maxnumber.of.embedded.workers=4

Cheers,
Gift

Original comment by nuka....@gmail.com on 20 Jan 2015 at 10:02

reage / interproscan

[interhelp #31898] SGE submission does not happen #57