Currently, the user has to start the manager and the instances themselves, and tell the components where to find the manager. That's not what we want, we want to have a single command that can start the whole thing, like MUSCLE 2 had. Let's call it muscle-run. Here are some thoughts on how it should be implemented, with thanks to @diregoblin for the discussion.
Starting instances should be implemented using a pilot job framework, so that we can distribute the instances over the available nodes in an HPC allocation, but we also need it to work locally for testing and less compute intensive runs. QCG-PilotJob offers this, and will be the first option to try. The actual starting of instances should be controlled by muscle_manager, so that we can later have it start instances on demand. So muscle-run starts the pilot job framework and the manager, then the manager controls the pilot job framework.
If we're going to start instances, then we're going to have to deal with input and output files, working directories, etc. We want to be able to specify an input directory, globally and/or per-instance, and a run directory. The run directory will have a subdirectory for every instance, and the instances are started within their own subdirectory. The input directory path is passed to the model via an environment variable, something like MUSCLE_INPUT_DIR. Standard out and standard error are redirected to log files in the instance's working directory (or maybe in the global work dir?). Output files should be written to the working directory by the model. On HPC, you typically want to write output to an on-node temp dir, then copy it to an archive or shared disk at the end of the run. For this to work, the pilot job must make a temp dir, start the submodel in it, then when the submodel finishes move the on-node temp dir into the overall run dir.
How to start instances will be configurable from the yMMSL file. There should be an extra top-level attribute implementations which contains a list of objects describing how to start a particular installed implementation, with the usual sugar so it'll be a dict on the YAML side. So that's environment, executable path, arguments, and maybe it should be possible to specify a list of commands so that you can activate a venv or Conda env or something. In the compute_elements section we already can refer to these.
To make dealing with configuration files easier, muscle_manager should take any number of configuration files, which will be merged on top of each other left to right (i.e. rightmost wins). Then we can have one file with compute elements and topology, another one with settings for a specific experiment, and a third with installations for a particular machine. Then the user can easily mix and match.
Currently, the user has to start the manager and the instances themselves, and tell the components where to find the manager. That's not what we want, we want to have a single command that can start the whole thing, like MUSCLE 2 had. Let's call it
muscle-run
. Here are some thoughts on how it should be implemented, with thanks to @diregoblin for the discussion.Starting instances should be implemented using a pilot job framework, so that we can distribute the instances over the available nodes in an HPC allocation, but we also need it to work locally for testing and less compute intensive runs. QCG-PilotJob offers this, and will be the first option to try. The actual starting of instances should be controlled by
muscle_manager
, so that we can later have it start instances on demand. Somuscle-run
starts the pilot job framework and the manager, then the manager controls the pilot job framework.If we're going to start instances, then we're going to have to deal with input and output files, working directories, etc. We want to be able to specify an input directory, globally and/or per-instance, and a run directory. The run directory will have a subdirectory for every instance, and the instances are started within their own subdirectory. The input directory path is passed to the model via an environment variable, something like
MUSCLE_INPUT_DIR
. Standard out and standard error are redirected to log files in the instance's working directory (or maybe in the global work dir?). Output files should be written to the working directory by the model. On HPC, you typically want to write output to an on-node temp dir, then copy it to an archive or shared disk at the end of the run. For this to work, the pilot job must make a temp dir, start the submodel in it, then when the submodel finishes move the on-node temp dir into the overall run dir.How to start instances will be configurable from the yMMSL file. There should be an extra top-level attribute
implementations
which contains a list of objects describing how to start a particular installed implementation, with the usual sugar so it'll be a dict on the YAML side. So that's environment, executable path, arguments, and maybe it should be possible to specify a list of commands so that you can activate a venv or Conda env or something. In thecompute_elements
section we already can refer to these.To make dealing with configuration files easier,
muscle_manager
should take any number of configuration files, which will be merged on top of each other left to right (i.e. rightmost wins). Then we can have one file with compute elements and topology, another one with settings for a specific experiment, and a third with installations for a particular machine. Then the user can easily mix and match.