mlexchange / mlex_computing_api

A collection of shared utilities/libraries, created with the goal of wrapping in a REST api
Other
0 stars 3 forks source link

Refactor for multiple deployment scenarios #23

Open dylanmcreynolds opened 10 months ago

dylanmcreynolds commented 10 months ago

As part of the IRI project, we need to be able to launch jobs centrally and have them run on NERSC, ALCF and OLCF. The current design launches jobs as containers locally. We need to be able to support multiple forms of job dispatching (e.g. NERSC SFAPI or however we can do it at the other facilities).

This is a large task. Some of the things that we will need to complete it:

The design of this refactor could go in many ways.

We could potentially plug Prefect in and have it manage jobs. Then, agents at different facilities could poll for jobs. In this case, the compute api would be enhanced to launch flows on a prefect server.

dylanmcreynolds commented 9 months ago

Just to add to the first point about HPC. The dependence on using the docker to launch docker is not supported in a variety of systems. It won't work in userspace-only mode in docker/podman, etc.