run-house / runhouse

Dispatch and distribute your ML training to "serverless" clusters in Python, like PyTorch for ML infra. Iterable, debuggable, multi-cloud/on-prem, identical across research and production.
https://run.house
Apache License 2.0
981 stars 37 forks source link

[KIT-67] Runs #45

Closed dongreenberg closed 1 year ago

dongreenberg commented 1 year ago

Basic API ideas (WIP):

Create Run object (captures logs, inputs, outputs, other artifacts read or written within call, who ran, where):

res = fn(**kwargs, name=”my_run”)

A run is a folder (created inside local rh directory by default), and can be sent elsewhere to persist logs, results, artifact info, etc.:

rh.run(name=“my_run”).to("s3", path="runhouse/nlp_team/bert_ft/results")

Ideally, we can have a "default log store" setting in the user config so the logs from their runs can be sent to the same place by default when they save, rather than having to send each run one by one.

This could be the way for users to configure for artifacts/logs to flow to an existing MLFlow store, or to flow to W&B, Grafana, Datadog, etc.

Save the run to local or RNS (not all runs need to be saved)

rh.run(name=“my_run”).save()

Creates a run object by tracing the activity within the block - no inputs and outputs, but captures logs (perhaps several logfiles for different calls) and artifacts used:

with rh.run(name=”my_run”) as r:

Big feature, essentially the same as auto-caching in orchestrators - check if this run was already completed, and load results if so, otherwise run:

res = fn.get_or_run(name=”yelp_review_preproc_test”)

Create/name a CLI run:

r = my_cluster.run(["python test_bert.py --gpus 4 --model distilbert"], name="test_distilbert_ddp")

Inspiration: this MLFlow example

We can also support event (failure or completion) notifications through knocknock or pagerduty!

Cc @caroline

From SyncLinear.com | KIT-67

dongreenberg commented 1 year ago

Interesting approach using Ray and OpenTelemetry: https://composable-logs.github.io/composable-logs/home/

jlewitt1 commented 1 year ago

Some more context based on the above:

3 main ways of invoking/creating a run:

  1. Function call
  2. CLI command
  3. Context manager

Logs

Context Manager

Saving inputs & outputs for a Run