multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Develop a launcher for MUSCLE3 #19

Open djgroen opened 5 years ago

djgroen commented 5 years ago

I'm planning to make a launcher for MUSCLE3 in the form of a FabSim3 plugin, to make it easier to start and coordinate submodels, especially when they are launched remotely, or in distributed fashion.

Before I make a first version of such a launcher, I would to ask for any specific design preferences. @LourensVeen are there particular preferences from your side in regards to how I design or construct such a launcher, or what kind of functionalities it should support?

I plan to keep this issue for design discussions and feedback, while moving any technical discussions and other tasks to the respective plugin repository. But if you would like to organize it otherwise, let me know and I'll rearrange things.

LourensVeen commented 5 years ago

All components need to run inside of a single allocation, so the basic idea is to submit a job to the scheduler with enough resources to run the whole simulation. Within that job, the basic functionality is to first start the manager, passing it the yMMSL file, and to get its location from its standard output. Then start the compute element instances, passing them the location of the manager.

This means that something is needed to start the instances on specific nodes/cores within the allocation, e.g. srun, an srun wrapper, or a pilot job framework. And we need to have information on how many resources each instance needs. My idea here is to extend yMMSL with that information, using a third top-level attribute (besides the model and the settings). It would probably be a list of compute elements (or perhaps instances, or the ability to specify either) with for each the required resources. Number of cores is the primary requirement I'd say, things like RAM and disk space could be added later.

The launcher would then read that information, add it up to calculate the total number of cores, and allocate that from the scheduler. Within that allocation another component would read it again and start each component as described above.

So the algorithm for the launcher would look something like this, I think:

Then within the allocation:

Submodels may require additional inputs in the form of data files, and they may produce output files. There has to be a description of those as well somewhere, so that they can be staged in and out, but I'm not sure that the yMMSL file is the right place to do so. How does FabSim3 deal with that currently? And how do you specify the amount of required resources?

djgroen commented 5 years ago

@LourensVeen Thanks for all the pointers. I am now starting work on this, which you can track here: https://github.com/djgroen/FabMUSCLE

(sorry for the delay: a tutorial workshop in Ethiopia and a summer break were in the way :/)

djgroen commented 5 years ago

I also opened an issue on this in the FabSim3 repo, which is here: https://github.com/djgroen/FabSim3/issues/145

djgroen commented 5 years ago

@LourensVeen I'm setting up various templates now as a first step. Now by default I tend to use python3 as an executable, but in your tutorial here you use python. https://muscle3.readthedocs.io/en/latest/distributed_execution.html

Do you want me to make the Python executable name reconfigurable in FabSim3, or are you happy for me to default to Python3 for the time being?

(I'll continue working on other parts in the meantime)

djgroen commented 5 years ago

@LourensVeen Just to clarify: I added @arabnejad (one of my post-docs) as a collaborator so that he can also contribute comments to this issue.

djgroen commented 5 years ago

Okay, small update here: I've got the all-in-one script working now on localhost, either as a stand-alone and with replicas enabled. I also generalized the installation command such that it can be launched on remote machines.

Next up is:

djgroen commented 5 years ago

I managed to enable (in early form) the execution of a distributed MUSCLE application on localhost. Currently, the execution mechanism is quite simplistic: I simply start the MUSCLE manager and all the ComputeElements concurrently in the local host.

Later on I'll make this more advanced, but first I want to see how far I can get to get this basic stuff working on Eagle for M15 :).