saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

BJ manual: wall_time_limit, number_of_processes #166

Open mturilli opened 10 years ago

mturilli commented 10 years ago

Dear BJ team,

I have some issues with the current BJ documentation. At http://saga-project.github.io/BigJob/sphinxdoc/usage/appwriting.html I read:

"wall_time_limit - Specifies the number of minutes the resources are requested for."

I tried to use 'wall_time_limit' and BJ did not honoured it. Instead, I had to use 'walltime'. Is 'wall_time_limit' correct? If so, how it differs from 'walltime'?

In the same page and here: http://saga-project.github.io/BigJob/sphinxdoc/library/index.html I read also:

"number_of_processes - This refers to the number of cores that need to be allocated to run the jobs"

Does a pilot span across multiple nodes when a number of processes greater than the number of cores of a single node have been requested? If so, is there a way to inspect a pilot so to know on how many nodes it has been scheduled and, in case, is being executed?

Many thanks, Matteo

melrom commented 10 years ago

Hey Matteo, thanks. Ole mentioned that the appwriting page was pointless a few months ago. Forgot to remove it. Thanks!

As for the second question, so you request cores in multiples of nodes - therefore, if you want to run on 16 1-core jobs on Lonestar at one time, you either have to marshal 24 cores (2 nodes) for the Pilot (http://saga-project.github.io/BigJob/sphinxdoc/tutorial/table.html), or marshal just one node and they wouldn't all run at the same time, therefore, you obtain the node, start running 12 1-core jobs, when 1 finishes, you can add the next 1-core job, etc. until you run 16 cores. It probably depends on your TTC, budget, etc. which of the two you want to do.

As for knowing the number of nodes it has been scheduled, I usually use qstat, which will tell you number of slots, wherein 1 slot = 1 core - I mean, if you tried to ask for 16 cores on Lonestar (from either BJ or saga-python), you would get an error.

------------------> Rejecting job <------------------ Your slot (or core) request is not a multiple of 12. Syntax: -pe where is a multiple of 12.