saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

Library Reference Sphinx Documentation #74

Closed melrom closed 10 years ago

melrom commented 11 years ago

End User Comments:

One immediate comment though as I fast forwarded to the section on Library Reference, as that's what I felt was an area where the existing documentation was a bit lacking: it would be very useful to be more specific about the definitions of parameters and methods. For example "processes_per_host" is mentioned but not really defined. "Required on some" architectures is not a definition. The association with "MPI" is confusing as it is a PilotComputeDescription option not a ComputeUnit parameter. I think what most users want and have requested is a detailed description of the algorithm used to translate this and other parameters to a "-l nodes=..." specification and "mpirun" command, say. Similarly, the description of the get_state() methods is incomplete and of limited usefulness to users without a description of the return values and their meanings.

melrom commented 11 years ago

End User Comments:

Personally, I don't think so. Maybe we are giving too much thought about this ppn ... but I think it's a good example of useful vs. less useful documentation for the users.

Wording such as "mapped to the SAGA job description" or "it's a SAGA issue" are not very helpful. Myself and, I adventure to say, many BigJob users really don't know what SAGA is, what they should do to interpret those statements and, very importantly, make an actionable choice based on them.

My thoughts on this mainly originate from having to write documentation for commercial software users. At the most basic level users are only interested in:

A: what's the input (in our case pilot and compute units descriptions) Z: what's the expected output/outcome (nodes/cores request or mpirun command options or similar)

The B to Y steps to go from A to Z should be hidden as much as possible if users have no way to interact with them.

If the design of BigJob is to make it as easy as possible for a vast user audience to run distributed code, and if one of the design principles is to create an interface that completely hides the complexities behind SAGA, etc. then in principle references to "SAGA" should not appear in the BigJob documentation.

Obviously in some cases a complete wrapping of the underlying complexities is sometimes not possible. So for example in one project I was involved with in a commercial setting we realized that, based on customer feedback, that there was no easy way to hide all of the steps between A and Z for some power users. We therefore decided to let users interact with B directly if they wanted to. Which meant writing good documentation for both A and B. That made the power users happy while regular users kept interacting with A ignoring the complexities of B.

So in our example if "ppn" is a "SAGA issue" and there's no way around it to understand it, then at a minimum a reference/link to the relevant SAGA documentation should be provided. It goes without saying that this might be a slippery slope though. If users need to study SAGA to use BigJob then they are likely to end up interacting with SAGA, not BigJob ...

melrom commented 11 years ago

End User Comments:

To me the confusing thing about ppn is that on XSEDE machines it doesn't seem to affect anything. I always set it to 1 and everything seems to be fine. I recently installed BigJob on our local PBS cluster made up of 8-core nodes and the PBS request always ends up being "-l nodes=X:ppn=8" no matter what I set ppn to. Furthermore ppn is an actual option for PBS but not for SGE, say. How's ppn "translated" to SGE parallel environment settings, say? Also, in general "ppn" is not a property of the cluster configuration. I can run a job with -l nodes=1:ppn=1 or -l nodes=1:ppn=8 on the same cluster and that has different effects on the way executables are launched ...

The reason I have repeatedly asked for some clarity on this option is not really specific to this option. I spent a good part of a day few weeks ago debugging a BigJob installation playing with ppn and other options. At the end the issues had nothing to do with ppn. But the uncertainty about that option added an uncontrolled parameter to the other 13K variables and settings one is contending with BigJob as well as the MD engines, MPI, replica exchange, etc. ...

drelu commented 11 years ago

As said thousand of times: this depends on your cluster configuration. I am fine with explaining it without SAGA. Despite all abstractions, the user just needs to know some particularities of the resource he is running on.

On Trestles you certainly don't run anything without the right ppn:

(python)[luckow@trestles-login1 ~]$ qsub -I -lnodes=1 -l nodes=...:ppn=:... is required qsub: submit filter returned an error code, aborting job submission.

(python)[luckow@trestles-login1 ~]$ qsub -I -lnodes=1:ppn=1 -I requires -l nodes=...,walltime=... qsub: submit filter returned an error code, aborting job submission.

melrom commented 10 years ago

We are attempting to mitigate these issues with a table of how to configure for each machine in XSEDE/FutureGrid. Aside from this, @drelu has edited and updated the Sphinx API documentation to satisfactory level.