radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

bg/q and shell spawner dont work together #474

Closed marksantcroos closed 8 years ago

marksantcroos commented 9 years ago

With the new shell spawner CUs fail on Joule with:

2015-01-22 12:21:09.760 (FATAL) [0x40001188ea0] LL15012212200109:23755:ibm.runjob.client.Job: could not start job: job failed to start
2015-01-22 12:21:09.761 (FATAL) [0x40001188ea0] LL15012212200109:23755:ibm.runjob.client.Job: scheduler failed the job with: Jobs not submitted by LoadLeveler cannot run because BG_ALLOW_LL_JOBS_ONLY is TRUE

This is likely because the shell spawner ends up with a fresh session/group or so.

According to the IBM documentation the admins on Joule changed the default. Would be good to find out why.

As a fall back we still have the popen spawner that works.

andre-merzky commented 9 years ago

putting CUs into their own process session might not strictly be necessary -- the agent exists for the lifetime of the units, so we should not be concerned about that failure mode (new session prevents the child from dying with the parent, and also prevents zombies in that context). Shall we try without on Joule? Would you mind giving it a try with set -m removed in this line: https://github.com/radical-cybertools/radical.pilot/blob/devel/src/radical/pilot/agent/radical-pilot-spawner.sh#L405

marksantcroos commented 9 years ago

Would you mind giving it a try with set -m removed

Doesn't seem to make a difference.

andre-merzky commented 9 years ago

Hmm, strange... You can switch to popen meanwhile, per #475, but let me ponder over this one. In what context does the error come up? Could you have a look into the unit's subdir in the spawner tmp tree?

marksantcroos commented 9 years ago

You can switch to popen meanwhile, per #475,

Done, thanks.

In what context does the error come up? Could you have a look into the unit's subdir in the spawner tmp tree?

-ECANNOTPARSE.

andre-merzky commented 9 years ago

In what context does the error come up? Could you have a look into the unit's subdir in the spawner tmp tree?

-ECANNOTPARSE.

I mean the error Jobs not submitted by LoadLeveler cannot run because BG_ALLOW_LL_JOBS_ONLY is TRUE: in what log did that show up, and what are the lines before it? Thanks!

marksantcroos commented 9 years ago

I mean the error Jobs not submitted by LoadLeveler cannot run because BG_ALLOW_LL_JOBS_ONLY is TRUE: in what log did that show up,

This is the output of runjob, and therefore ends up in CU stdout.

and what are the lines before it?

None. Should there be any?

andre-merzky commented 8 years ago

Since popen works and is the default anyway, I'll close this one.