gwastro / pycbc

Core package to analyze gravitational-wave data, find signals, and study their parameters. This package was used in the first direct detection of gravitational waves (GW150914), and is used in the ongoing analysis of LIGO/Virgo data.
http://pycbc.org
GNU General Public License v3.0
312 stars 347 forks source link

Need to think about how to handle pegasus executable entries for different tags #277

Closed duncan-brown closed 8 years ago

duncan-brown commented 8 years ago

This issue only affects MPI running on Stampede, so it's not urgent, but I'm putting it here so as not to lose the issue.

Larne, Duncan,

After some further thought on the request to use the same executable entry for each injection run, I realized that my simple suggestion to merge injection executable was wrong, and will not work in general for the injection inspiral jobs.

I am cc'ing Ian, as this may involve significant changes to the way the pycbc workflow modules conceptually work.

The model that we have used for pycbc workflow is that it would be built from a top-level set of components that are essentially procedural, but modular, where each would add some useful amount of work to the workflow in an independent manner, and only share information by explicit lists of result files, ensuring that the data dependencies are clearly visible at the highest level.

Executables are only instantiated and used within the context of a single workflow function. The model we have used is that they independent generators of jobs. As such, in principle (through ini file configuration), although not I believe in active use, one could use a different physical executable for each call to a setup function.

To recap the main problems we are having, as have discovered by running on stampeded are the following.

1) Each ifo/injection combination gets a different entry in the transformation catalog. a) An executable is staged for each entry even if the PFN is the same (it is renamed on the remote site) b) horizontal clustering if the main mode of analysis and operates at the level of the transformation catalog, meaning that it can only cluster jobs made by a single executable instance at the moment, which is a problem in the case of a large number of injection sets. Whereas with label-based clustering we can easily set to a few jobs (i.e. 1 per ifo is a trivial ini file option), but can't granularly control the number of clustered jobs with a single option.

2) There is a general conceptual mismatch between the Dax3.Executable class and the pycbc Executable class. The worry is that this could cause us to move out of step with Pegasus development and cause further issues down the line.

Larne, Duncan, is that a fair understanding of what the problems are? Please point out any other issues.

The current model has a number of consequences that make it difficult to quickly implement the model you are requesting, though certainly not unfeasible.

These are the main technical issues.

1) Setup functions do not share executable instances. Currently, they only share files, file lists, data products, etc. The idea was that this is something that should not be exposed at the top-level as it has no bearing on the actual connection of data driven plumbing.

2) Executables instances are viewed as generators. This means they keep track of the common options for the executable (derived from the ini file), and information such as which output folder the files it will generate will be stored in, etc, which needs to be separate.

We could explicitly instantiate executables in the top-level workflow and pass them to the setup functions. We would also need to move the common option, output folder, etc. logic into the node instantiation functions. This would require changes to nearly all parts of the code. We would need to pass information we would normally pass to the executable, to the node creation instead. These are mostly local changes, however, as typically the executable is generated and then called many times to generate jobs. We would have to think very carefully how this would affect the configuration file hooks, and while I think this change could be made without changing the way the configuration file addresses different tasks, that requires some thought and verification. I am struggling with is the idea you one passes around the executable instance between setup function calls. To me the current agreement about what goes into and comes out of a workflow setup function is very clear, and this would break that agreement. Do you do this for all setup functions, or do we have the inspiral ones as a special case?

If this is a change worth making, then would need to plan for it, and it certainly can't be done until after O1. In the meantime, are there show-stoppers on why we can't use label-based clustering for now?

-Alex

I'm missing the context here (this appears to be the result of some previous runs which I haven't heard about!), so my answer probably doesn't make sense. But ....

Is the problem is that you are wanting to cluster over different injection runs? As Alex says this could be done with label clustering. However, why would you want to do that? Even in our largest workflows we have never run more than 200 injection sets. Now you could use horizontal clustering to cluster all inspiral jobs in each injection run together, giving 200 jobs. What is the problem with that?

Maybe the issue is gwf file staging? Maybe you want a node to copy a bunch of data files and then have a bunch of inspiral jobs from different inspiral runs analyse the same data files? If that is the goal then you need label clustering, horizontal clustering will just cluster jobs at random.

Cheers Ian

duncan-brown commented 8 years ago

Closing as this is not an issue with the new way of XSEDE running proposed in https://github.com/ligo-cbc/pycbc/pull/559