mlexchange / mlex_prefect_worker

Other
0 stars 2 forks source link

Dispatch jobs based on internal configuration #23

Open dylanmcreynolds opened 1 month ago

dylanmcreynolds commented 1 month ago

Right now, MLEX apps must know a tremendous amount about the infrastructure that will launch jobs to run training or inference. We would like to let app be able to present the user with a particular job type (from a selection of job types) and simply run it. But then the job type would be a configuration on the server that maps to a lot of information about how to launch a job e.g.:

A quick summary example:

  1. In the segmentation app, the user selects TUNet+ and their model parameters
  2. The app starts a generic parent flow in prefect.
  3. The parent flow looks up a local configuration (from the file system, or prefect block?)
  4. The parent flow looks up a configuration dictionary that defined, at this very moment, TUNet+ is being served by NERSC over SFAPI. The configuration may also store some parameters that are very specific to running TUNet+ NERSC SFAPI.

This seems pretty flexible. In some cases the user might care very much where the model is run, in which cases the configuration could be TUNet+ at NERSC or in some cases, we think the user will not care.

This brings up the question...what is the best way to have the segmentation app get a list of current configuration? @Wiebke @taxe10

Wiebke commented 1 week ago

Based on an in person discussion, we will be exploring the use of Prefect Blocks to store job configuration details concerning infrastructure and leave it up to the developers setting up Prefect to define blocks for to their local infrastructure and support of the logic to switch between multiple compute resouces (e.g. local vs supercomputer such as Nersc).

Our first iteration will need to support current use of the flows defined in mlex_prefect_worker in mlexchange/mlex_highres_segmentation and mlexchange/mlex_latentspaceexplorer, as well as currently used compute infrastructure where podman, conda, and mamba environments, as well as slurm are in use.