[KIT-67] Runs - Githubissues

run-house / runhouse

Dispatch and distribute your ML training to "serverless" clusters in Python, like PyTorch for ML infra. Iterable, debuggable, multi-cloud/on-prem, identical across research and production.

Apache License 2.0

981 stars 37 forks source link

Basic API ideas (WIP):

Create Run object (captures logs, inputs, outputs, other artifacts read or written within call, who ran, where):

res = fn(**kwargs, name=”my_run”)

A run is a folder (created inside local rh directory by default), and can be sent elsewhere to persist logs, results, artifact info, etc.:

rh.run(name=“my_run”).to("s3", path="runhouse/nlp_team/bert_ft/results")

Ideally, we can have a "default log store" setting in the user config so the logs from their runs can be sent to the same place by default when they save, rather than having to send each run one by one.

This could be the way for users to configure for artifacts/logs to flow to an existing MLFlow store, or to flow to W&B, Grafana, Datadog, etc.

Save the run to local or RNS (not all runs need to be saved)

rh.run(name=“my_run”).save()

Creates a run object by tracing the activity within the block - no inputs and outputs, but captures logs (perhaps several logfiles for different calls) and artifacts used:

with rh.run(name=”my_run”) as r:

Big feature, essentially the same as auto-caching in orchestrators - check if this run was already completed, and load results if so, otherwise run:

res = fn.get_or_run(name=”yelp_review_preproc_test”)

Create/name a CLI run:

r = my_cluster.run(["python test_bert.py --gpus 4 --model distilbert"], name="test_distilbert_ddp")

Inspiration: this MLFlow example

We can also support event (failure or completion) notifications through knocknock or pagerduty!

Cc @caroline

_{From SyncLinear.com | KIT-67}

Some more context based on the above:

3 main ways of invoking/creating a run:

Function call
CLI command
Context manager

Logs

The .rh directory on a cluster will contain logs for each run, with each run having its own unique key (user should be able to overwrite that key name with their own)
Run object should hold a folder, which contains references to the log files for that run that were generated on the cluster.
By default those files should live on the cluster, if the user wants to stream or copy them from the cluster to the local env they can do so manually: cluster_folder.to("here", path=local_path)
If user wants to save run files locally, by default save them to the rh folder of the projects main working directory

Context Manager

Should capture the local stdout and stderr from the local execution.
Every function called in the context manager should get its own run key, which corresponds to the log file saved inside the cluster.
We should also trace each of the Runhouse objects that were used / called within the context manager (to be useful later on when tracing the usage of various functions / artifacts)

Saving inputs & outputs for a Run

Cloudpickle the inputs and outputs rather than try to json serialize them into a file
Inputs (args + kwargs) should be cloudpickled separately from the output

run-house / runhouse

[KIT-67] Runs #45