iterative / dvclive

📈 Log and track ML metrics, parameters, models with Git and/or DVC
https://dvc.org/doc/dvclive
Apache License 2.0
167 stars 37 forks source link

Support S3 URI as Live.dir to store DVCLive data in cloud storage #676

Open aschuh-hf opened 1 year ago

aschuh-hf commented 1 year ago

I am using Ray Jobs to execute training on EC2 worker nodes forming an auto-scaling Ray Cluster. I would like to save DVCLive output in a persistent remote storage location (AWS S3). If the Ray Job would use dvc exp run, the output can be saved at the end of training to the Git repo via dvc exp push. But if a failure would occur or training is interrupted, the output up to that point would only be on the worker node which will get terminated.

Further, in my setup, I am not actually running dvc exp run as the Ray Job command. Instead, I am running dvc exp run --queue && dvc queue start (using CLI or VS Code Extension) on my local machine. The DVC queue task is a custom script which submits the training job and sync's the intermediate outputs from remote storage in S3 to the local machine at regular intervals until the job is in a terminal state. The advantage of doing this is that I can use local dvc exp commands as if the training tasks would be running locally, e.g., to follow progress, compare live plots, etc. My custom script thus takes care of the download of the training job output from S3 to the local machine. PyTorch Lightning TensorBoard logs can be written directly to S3 as supported by lightning.pytorch.loggers.tensorboard.TensorBoardLogger where save_dir can be a URI, and Ray Tune / Train state is uploaded by Ray by specifying a storage_path URI.

With Ray Train <2.5, I was using RunConfig(local_dir=...) along side SyncConfig(upload_dir=...) to have Ray upload the DVCLive output located in local_dir from the local directory to the remote storage. This is now deprecated since Ray 2.5. When the job is running on the Ray cluster head node, the dvclive output folder is still synced correctly, but when run on worker nodes it is not (even though it exists on the local EC2 instance drive which I checked).

I could likely fix this by adjusting my Ray Train script to either save the DVCLive output to a folder inside the Ray session.get_trial_dir() or taking care of the upload of DVCLive output myself.

However, ideally, DVCLive itself (or at least the dvclive.lightning.DVCLiveLogger) would support URIs (S3 in my case) as an output dir for logged data, just as the TensorBoardLogger of PyTorch Lightning supports (thanks to URI support of torch.utils.tensorboard.SummaryWriter).

pl.Trainer(
  logger=DVCLiveLogger(dir="s3://bucket/prefix/dvclive", save_dvc_exp=False, dvcyaml=False),
  ...
)
aschuh-hf commented 1 year ago

I worked arounds this with a small DVCLiveLogger wrapper.

from deepali.core.storage import StorageObject

from dvclive.lightning import DVCLiveLogger as _DVCLiveLogger
from dvclive.lightning import ModelCheckpoint, rank_zero_only

class DVCLiveLogger(_DVCLiveLogger):
    def __init__(self, dir: str, **kwargs) -> None:
        storage = StorageObject.from_path(dir)
        super().__init__(dir=str(storage.path), **kwargs)
        self.storage = storage

    def after_save_checkpoint(self, checkpoint_callback: ModelCheckpoint) -> None:
        super().after_save_checkpoint(checkpoint_callback)
        self.storage.push(force=True)

    @rank_zero_only
    def finalize(self, status: str) -> None:
        super().finalize(status)
        self.storage.push(force=True)

Note that I'm using a little helper from my own open source lib here for syncing a local temp folder with S3. You could probably use PyTorch Lightnings fsspec abstractions instead rather.

dberenbaum commented 1 year ago

@aschuh-hf I thought there was a GH issue for this already, but I can't seem to find it. Do you plan to change the path for each experiment you run?

aschuh-hf commented 1 year ago

Yes, in this case, where experiments are executed as batch jobs potentially at the same time (parallel experiment runs) in a cluster environment such as Ray with results stored in cloud storage such as S3 I have to use a separate output path for each experiment. On the local machine, DVC is creating and managing the temp folders for me.

My train stage script uses a timestamp and DVC_EXP_NAME and DVC_EXP_BASELINE_REV environment variables to derive an experiment run suffix for output URIs (Ray Train storage, DVCLive output dir and/or TensorBoard log dir).

version="$(date -u +%Y-%m-%d)-${DVC_EXP_BASELINE_REV:0:8}-${DVC_EXP_NAME}"

s3_dvc_dir="s3://${s3_bucket}/${s3_prefix}dvc/${version}/"
s3_log_dir="s3://${s3_bucket}/${s3_prefix}logs/${version}/"
s3_out_dir="s3://${s3_bucket}/${s3_prefix}state/"  # excl. version subdir because Ray Train appends it

The script then uses aws s3 sync to download the contents from s3_dvc_dir and s3_out_dir to local paths excluding the version suffix to match the output paths specified for metrics, plots, and outs of my train stage in dvc.yaml.

The following is run in a loop as long as the Ray job is not in a terminal state:

sync_job_output()
{
    if aws s3 ls "${s3_dvc_dir}" > /dev/null; then
        aws s3 sync --only-show-errors "${s3_dvc_dir}" "${dvc_dir}/"
        [ $verbose -lt 2 ] || info "Synced DVCLive metrics and plots"
    fi
    if aws s3 ls "${s3_log_dir}" > /dev/null; then
        aws s3 sync --only-show-errors "${s3_log_dir}" "${log_dir}/"
        [ $verbose -lt 2 ] || info "Synced TensorBoard log files"
    fi
    if aws s3 ls "${s3_out_dir}${version}/" > /dev/null; then
        aws s3 sync --only-show-errors "${s3_out_dir}${version}/" "${out_dir}/"
        [ $verbose -lt 2 ] || info "Synced Ray checkpoints and state"
    fi
}

dvc.yaml:

stages:
  train:
    cmd: bash train.sh
    params:
    - data
    - loss
    - model
    - train
    deps:
    - train.sh
    - ...
    outs:
    - data/train/state
    metrics:
    - data/train/metrics.json
    plots:
    - data/train/plots/metrics:
        x: step
dberenbaum commented 1 year ago

Nice @aschuh-hf! I was looking a bit into #237, and this looks like the same idea that's needed there. It's also related to the discussion in #638. It would be great to converge on a smoother experience here that doesn't require adding a helper to sync to and from cloud storage to track jobs run on ray or in other remote/distributed compute scenarios.

aschuh-hf commented 1 year ago

An integration with Ray Tune (and Ray Train) sounds fantastic.

When using Ray Tune, I currently disable DVCLive because Tune already provides me with an experiment analysis object which basically let's me compare trials, e.g., as Pandas dataframe. But it would indeed be better and more convenient if those trial results would be nicely integrated with individual DVC experiment runs instead (across dvc exp CLI output, VS Code tables, and DVC Studio).

The reason I chose not to write the DVCLive to the trial working directory, and thus having to upload the artifacts to remote storage manually in the DVCLiveLogger is that the trial folder names are more cryptic and I would need to use Ray to parse the experiment JSON files to figure out under which S3 prefix the results are stored, even though with Ray Train I only have one trial folder. Choosing my own upload path made it easier to download the data using aws s3 sync in my local script in order to place it where DVC expects to find it (i.e., the metrics, plots, and outs paths configured in dvc.yaml).

doesn't require adding a helper to sync to and from cloud storage

Not having to use my own helper would be great!

Another aspect is the handling of AWS credentials in this case for long running training jobs. I haven't looked into how to enable the script itself to update credentials before they expire (so if DVC can handle that for me even better), but basically I need the local train stage script to wait until I manually update those (we are using okta-awscli) before it can get the latest results from cloud storage. Especially when the Ray train job finished, the script still needs to wait until it can get the final results before exiting such that DVC can create a final experiment commit rev.

A smoother integration may be to execute dvc exp run as Ray job command instead and save final results with dvc exp push. That way one doesn't need to worry about cloud storage credentials and having a long running helper script on the local machine. The DVC CLI tools (dvc exp show) and VS Code Extension could possibly also get the state of a running experiment from cloud storage directly to display the data of a not yet finished run alongside other experiments. That would sound like a nicer solution than my work-around.

In case of DVC Studio, I imagine the train job would directly submit the data to the running Studio server. It's just that DVC CLI and VS Code Extension have to actively obtain this data differently I suppose.

Since I first set this up, we also changed our Ray cluster IAM roles to allow me access to the GitHub repository. Do you think it would be better to exchange intermediate data with DVC CLI / VS Code Extension via the GitHub repository? I'm just not sure whether it would be good to push so many intermediate experiment results as commits to the remote repository (unless this is actually not an issue at all, or the previous commit would be replaced; in the end leaving just one commit for each run).

dberenbaum commented 12 months ago

Sorry for the long delay here @aschuh-hf! We are still thinking about this one but got caught up with other priorities.

When using Ray Tune, I currently disable DVCLive because Tune already provides me with an experiment analysis object which basically let's me compare trials, e.g., as Pandas dataframe. But it would indeed be better and more convenient if those trial results would be nicely integrated with individual DVC experiment runs instead (across dvc exp CLI output, VS Code tables, and DVC Studio).

Makes sense. I think it may be better for us to start with Ray Train here since it's simpler than managing results from many experiments at once.

Have you considered using EFS to share the repo across the cluster so you can actually write dvclive output there?

Otherwise, I think it should be possible to support s3 paths in dvclive, and we could document how to have your stage download those to the repo when training finishes. It doesn't feel like the cleanest solution to me but it could at least unblock you.

We could also provide realtime updates via Studio. Not sure if that interests you but could be helpful for the general public to not have to write a helper to intermittently download results.

cc @mnrozhkov