iterative / studio-support

❓ DVC Studio Issues, Question, and Discussions
https://studio.iterative.ai
16 stars 1 forks source link

Can't see any live updates #73

Closed haimat closed 1 year ago

haimat commented 1 year ago

Following the official Studio docs on live updates I have added two enviroment variables to our Github CI workflow:

STUDIO_TOKEN: ${{ secrets.DVC_STUDIO_TOKEN_MFB }}
STUDIO_REPO_URL: git@github.com:<company>/<repo>.git

However, I don't see any live updates in Studio. I do see the final results of the experiment, but not anything while it is running. The code via DVCLive is quite simple:

    metrics = _get_yolo_metrics(trainer.metrics)
    for metric_name, value in metrics.items():
        DVC.log_metric(metric_name, value)
    DVC.next_step()

This function is called at every training epoch. What am I missing?

shcheklein commented 1 year ago

@haimat could you try to add DVC.next_step() please?

tapadipti commented 1 year ago

@haimat Are you using log_metric() and next_step() from dvclive.Live (or trying to use it from dvc)?

haimat commented 1 year ago

@haimat could you try to add DVC.next_step() please?

@shcheklein Please have a look at my code example above - we are calling DVC.next_step() already.

haimat commented 1 year ago

@tapadipti We are calling DVC like this:


with Live(report="md") as DVC:
    ...
    metrics = _get_yolo_metrics(trainer.metrics)
    for metric_name, value in metrics.items():
        DVC.log_metric(metric_name, value)
    DVC.next_step()
    ...  
daavoo commented 1 year ago

Hi @haimat, could you please also set the following env vars and share the logs of that step in your GitHub CI workflow? (Don't hesitate to remove any confidential info before sharing them)

DVCLIVE_LOGLEVEL: DEBUG
DVC_STUDIO_CLIENT_LOGLEVEL: DEBUG
daavoo commented 1 year ago

Taking a look at the snippet you share, which framework are you using to train?

Forgive me if it's obvious, but is there some epoch iteration in the missing parts of the snippet? Where are trainer.metrics coming from?

with Live(report="md") as DVC:
    # somethig like this was missing in the snippets above 
    trainer.fit(epochs=3)
    ...
    metrics = _get_yolo_metrics(trainer.metrics)
    for metric_name, value in metrics.items():
        DVC.log_metric(metric_name, value)
    DVC.next_step()
    ...  
haimat commented 1 year ago

@daavoo Hello, we are using the YOLOv8 framework from Ultralytics. Here is some more code for you to understand our logic:

def main():
    DVC.log_params(params["training"])
    model = YOLO(params["training"]["model"])
    model.add_callback("on_fit_epoch_end", yolo_cb_fit_epoch_end)
    model.train(...)

def yolo_cb_fit_epoch_end(trainer):
    """This function is called at the end of every training epoch from model.train() above.
    The DVCLive metrics file are being correctly updated here."""

    metrics = _get_yolo_metrics(trainer.metrics)  # returns a simple dict of key/value pairs
    for metric_name, value in metrics.items():
        DVC.log_metric(metric_name, value)
    DVC.next_step()

if __name__ == "__main__":
    params = yaml.safe_load(stream)
    with Live(report="md") as DVC:
        main()
daavoo commented 1 year ago

Here is some more code for you to understand our logic:

Thanks! I get it know. Everything looks alright from the code. Could you please try https://github.com/iterative/studio-support/issues/73#issuecomment-1437352325 to further debug?

haimat commented 1 year ago

@daavoo So here is the output of the relevant lines from the Github action, after adding the DEBUG env variables:

...
2023-02-22T17:32:42.9908487Z 6 files added and 2843 files fetched
2023-02-22T17:32:43.5921564Z Verifying data sources in stage: 'data/ski-defects.dvc'
2023-02-22T17:32:44.9202506Z 
2023-02-22T17:32:44.9619444Z Running stage 'prepare':
2023-02-22T17:32:44.9620869Z > python src/dvc/prepare-data.py
2023-02-22T17:32:48.2030969Z Updating lock file 'dvc.lock'
2023-02-22T17:32:48.2100920Z 
2023-02-22T17:32:48.2221945Z Running stage 'training':
2023-02-22T17:32:48.2223398Z > python src/dvc/yolov8-trainer.py
2023-02-22T17:32:50.3784574Z DEBUG:dvclive:self._report_mode='md'
2023-02-22T17:32:50.3786300Z DEBUG:dvclive:`studio` report can't be used without a DVC Repo.
2023-02-22T17:32:50.3808412Z DEBUG:dvclive:Logged {'model': 'yolov8l.pt', 'image_size': 1280, 'epochs': 5, 'batch': 16, 'device': '0', 'random_seed': True} parameters to dvclive/params.yaml
...
2023-02-22T17:34:25.0559793Z DEBUG:dvclive:Logged map95: 0.41883
2023-02-22T17:34:25.0560806Z DEBUG:dvclive:Logged map50: 0.64363
2023-02-22T17:34:25.0561779Z DEBUG:dvclive:Logged precision: 0.70694
2023-02-22T17:34:25.0563123Z DEBUG:dvclive:Logged recall: 0.61965
2023-02-22T17:34:25.4487820Z DEBUG:dvclive:Step: 1
...
2023-02-22T17:35:38.7815897Z DEBUG:dvclive:Logged map95: 0.49632
2023-02-22T17:35:38.7818905Z DEBUG:dvclive:Logged map50: 0.73107
2023-02-22T17:35:38.7819920Z DEBUG:dvclive:Logged precision: 0.77785
2023-02-22T17:35:38.7823603Z DEBUG:dvclive:Logged recall: 0.64373
2023-02-22T17:35:39.1987393Z DEBUG:dvclive:Step: 2
...
2023-02-22T17:36:52.5310956Z DEBUG:dvclive:Logged map95: 0.54508
2023-02-22T17:36:52.5313356Z DEBUG:dvclive:Logged map50: 0.79815
2023-02-22T17:36:52.5316947Z DEBUG:dvclive:Logged precision: 0.77248
2023-02-22T17:36:52.5319392Z DEBUG:dvclive:Logged recall: 0.71641
2023-02-22T17:36:52.9912285Z DEBUG:dvclive:Step: 3
...
2023-02-22T17:38:06.3842480Z DEBUG:dvclive:Logged map95: 0.56688
2023-02-22T17:38:06.3844398Z DEBUG:dvclive:Logged map50: 0.81524
2023-02-22T17:38:06.3848437Z DEBUG:dvclive:Logged precision: 0.79571
2023-02-22T17:38:06.3851546Z DEBUG:dvclive:Logged recall: 0.72783
2023-02-22T17:38:06.8256364Z DEBUG:dvclive:Step: 4
...
2023-02-22T17:39:22.6577834Z DEBUG:dvclive:Logged map95: 0.57941
2023-02-22T17:39:22.6580463Z DEBUG:dvclive:Logged map50: 0.82216
2023-02-22T17:39:22.6583565Z DEBUG:dvclive:Logged precision: 0.80248
2023-02-22T17:39:22.6588463Z DEBUG:dvclive:Logged recall: 0.74549
2023-02-22T17:39:23.3859307Z DEBUG:dvclive:Step: 5
...
2023-02-22T17:39:33.8801083Z DEBUG:dvclive:Logged map95: 0.5793409533456171
2023-02-22T17:39:33.8802401Z DEBUG:dvclive:Logged map50: 0.8219465738683116
2023-02-22T17:39:33.8803836Z DEBUG:dvclive:Logged precision: 0.8011659733161465
2023-02-22T17:39:33.8805301Z DEBUG:dvclive:Logged recall: 0.7454864820648035
2023-02-22T17:39:34.5532907Z DEBUG:dvclive:Step: 6
...
2023-02-22T17:39:51.2596494Z Updating lock file 'dvc.lock'
2023-02-22T17:39:51.3221732Z Use `dvc push` to send your updates to remote storage.
2023-02-22T17:39:52.2013867Z 23 files pushed
daavoo commented 1 year ago

Thanks for the logs @haimat . How did you install DVC in the GitHub workflow? The issue might be that DVCLive requires the DVC Python API which is only available when you install DVC via pip install.

haimat commented 1 year ago

@daavoo We install DVC in the Github workflow like this:

      - uses: iterative/setup-dvc@v1
      - uses: iterative/setup-cml@v1

Isn't that the recommended way?

daavoo commented 1 year ago

Isn't that the recommended way?

I guess it depends on the needs. We have an open issue in DVCLive to remove the need for DVC to be installed as a Python library (https://github.com/iterative/dvclive/issues/397). Unfortunately, in order to use live metrics you need to install DVC instead via pip install dvc, for now.

haimat commented 1 year ago

@daavoo Thanks, I changed this now, so that DVC is being installed via pip install dvc, and how I can see the live progress of the experiment from Github in Studio. There is another issue now though: When I select that run from the list in Studio and click the "Plots" button, then I get the following info:

image

Or if I select both last entries:

image

What am I missing here?

haimat commented 1 year ago

@daavoo btw. I have all the TSV metrics files from DVCLive tracked in git, which is not being pushed from the CI runner on GitHub. I do, however, push all changes after the GitHub CI action back to the DVC remote. Might that be the problem, do I need to track these DVCLive metrics files with DVC instead of git?

daavoo commented 1 year ago

I have all the TSV metrics files from DVCLive tracked in git, which is not being pushed from the CI runner on GitHub.

I think that might be the problem. How do your dvc.yaml and the step in GitHub workflows look?

Are you pushing any git changes in the CI?

You should also push the git changes in your workflow not only for metrics but also for DVC metadata files like the dvc.lock. One way to do it is by using https://cml.dev/doc/ref/pr

There is a full example project here https://github.com/iterative/example-get-started-experiments/ but a stripped-down version would look like this:

      - name: training
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          STUDIO_TOKEN: ${{ secrets.STUDIO_TOKEN }}
          STUDIO_REPO_URL: ${{ secrets.STUDIO_REPO_URL }}
        run: |
          dvc pull

          cml ci --fetch-depth 0

          dvc exp run

          dvc push 

          cml pr --squash --skip-ci .
haimat commented 1 year ago

@daavoo Thanks for your reply. Here is my dvc.yaml file:

stages:
  prepare:
    cmd: python src/dvc/prepare-data.py
    deps:
      - src/dvc/prepare-data.py
      - data/yolo-workspace-data-config.yaml
      - data/${dataset}
    params:
      - dataset
      - prepare.ratio
    outs:
      - data/_yolo_config.yaml
      - data/workspace
  training:
    cmd: python src/dvc/yolov8-trainer.py
    deps:
      - src/dvc/yolov8-trainer.py
      - data/_yolo_config.yaml
      - data/workspace
    params:
      - dataset
      - training.model
      - training.image_size
      - training.epochs
      - training.batch
      - training.device
      - training.random_seed
    outs:
      - training/${dataset}
    metrics:
      - evaluation/results.json

plots:
  - mAP:
      x: step
      y:
        dvclive/plots/metrics/map95.tsv: map95
        dvclive/plots/metrics/map50.tsv: map50
      title: Model Performance
      x_label: epoch
      y_label: value

And here our relevant Github action:

          # Initialise workspace
          mv params-aime.yaml params.yaml
          DATASET_NAME=$(grep "dataset:" params.yaml | cut -d'"' -f 2)
          pip install -r src/dvc/requirements.txt

          # Setup and run DVC
          dvc remote add -d -f aime /dvcdata
          dvc config cache.type copy
          dvc pull --remote aime
          dvc repro --force
          dvc push --remote aime

          # Create plots in report
          grep -v ".png" dvclive/report.md >> report.md;
          dvc plots show --show-vega mAP > vega.json
          vl2png vega.json -s 1.5 > plot1.png
          echo '![](./plot1.png "Model Performance")' >> report.md

          # Commit report to Github via CML
          cml comment create report.md

We are not using a Github PR for training via CI - yet. Is there really a need for it? I am happy to trigger in the main branch - but I am also willing to change that if it makes sense. But we actually don't push anything back to Github, just to DVC, as you can see above.

So do I understand correctly, that we would need to push the results and changes from the CI runner back to Github repo too?

haimat commented 1 year ago

OK I fixed this by pushing the changes from CI back to Github 👍