Closed haimat closed 1 year ago
@haimat could you try to add DVC.next_step()
please?
@haimat Are you using log_metric()
and next_step()
from dvclive.Live
(or trying to use it from dvc
)?
@haimat could you try to add
DVC.next_step()
please?
@shcheklein Please have a look at my code example above - we are calling DVC.next_step()
already.
@tapadipti We are calling DVC like this:
with Live(report="md") as DVC:
...
metrics = _get_yolo_metrics(trainer.metrics)
for metric_name, value in metrics.items():
DVC.log_metric(metric_name, value)
DVC.next_step()
...
Hi @haimat, could you please also set the following env vars and share the logs of that step
in your GitHub CI workflow? (Don't hesitate to remove any confidential info before sharing them)
DVCLIVE_LOGLEVEL: DEBUG
DVC_STUDIO_CLIENT_LOGLEVEL: DEBUG
Taking a look at the snippet you share, which framework are you using to train?
Forgive me if it's obvious, but is there some epoch
iteration in the missing parts of the snippet? Where are trainer.metrics
coming from?
with Live(report="md") as DVC:
# somethig like this was missing in the snippets above
trainer.fit(epochs=3)
...
metrics = _get_yolo_metrics(trainer.metrics)
for metric_name, value in metrics.items():
DVC.log_metric(metric_name, value)
DVC.next_step()
...
@daavoo Hello, we are using the YOLOv8 framework from Ultralytics. Here is some more code for you to understand our logic:
def main():
DVC.log_params(params["training"])
model = YOLO(params["training"]["model"])
model.add_callback("on_fit_epoch_end", yolo_cb_fit_epoch_end)
model.train(...)
def yolo_cb_fit_epoch_end(trainer):
"""This function is called at the end of every training epoch from model.train() above.
The DVCLive metrics file are being correctly updated here."""
metrics = _get_yolo_metrics(trainer.metrics) # returns a simple dict of key/value pairs
for metric_name, value in metrics.items():
DVC.log_metric(metric_name, value)
DVC.next_step()
if __name__ == "__main__":
params = yaml.safe_load(stream)
with Live(report="md") as DVC:
main()
Here is some more code for you to understand our logic:
Thanks! I get it know. Everything looks alright from the code. Could you please try https://github.com/iterative/studio-support/issues/73#issuecomment-1437352325 to further debug?
@daavoo So here is the output of the relevant lines from the Github action, after adding the DEBUG
env variables:
...
2023-02-22T17:32:42.9908487Z 6 files added and 2843 files fetched
2023-02-22T17:32:43.5921564Z Verifying data sources in stage: 'data/ski-defects.dvc'
2023-02-22T17:32:44.9202506Z
2023-02-22T17:32:44.9619444Z Running stage 'prepare':
2023-02-22T17:32:44.9620869Z > python src/dvc/prepare-data.py
2023-02-22T17:32:48.2030969Z Updating lock file 'dvc.lock'
2023-02-22T17:32:48.2100920Z
2023-02-22T17:32:48.2221945Z Running stage 'training':
2023-02-22T17:32:48.2223398Z > python src/dvc/yolov8-trainer.py
2023-02-22T17:32:50.3784574Z DEBUG:dvclive:self._report_mode='md'
2023-02-22T17:32:50.3786300Z DEBUG:dvclive:`studio` report can't be used without a DVC Repo.
2023-02-22T17:32:50.3808412Z DEBUG:dvclive:Logged {'model': 'yolov8l.pt', 'image_size': 1280, 'epochs': 5, 'batch': 16, 'device': '0', 'random_seed': True} parameters to dvclive/params.yaml
...
2023-02-22T17:34:25.0559793Z DEBUG:dvclive:Logged map95: 0.41883
2023-02-22T17:34:25.0560806Z DEBUG:dvclive:Logged map50: 0.64363
2023-02-22T17:34:25.0561779Z DEBUG:dvclive:Logged precision: 0.70694
2023-02-22T17:34:25.0563123Z DEBUG:dvclive:Logged recall: 0.61965
2023-02-22T17:34:25.4487820Z DEBUG:dvclive:Step: 1
...
2023-02-22T17:35:38.7815897Z DEBUG:dvclive:Logged map95: 0.49632
2023-02-22T17:35:38.7818905Z DEBUG:dvclive:Logged map50: 0.73107
2023-02-22T17:35:38.7819920Z DEBUG:dvclive:Logged precision: 0.77785
2023-02-22T17:35:38.7823603Z DEBUG:dvclive:Logged recall: 0.64373
2023-02-22T17:35:39.1987393Z DEBUG:dvclive:Step: 2
...
2023-02-22T17:36:52.5310956Z DEBUG:dvclive:Logged map95: 0.54508
2023-02-22T17:36:52.5313356Z DEBUG:dvclive:Logged map50: 0.79815
2023-02-22T17:36:52.5316947Z DEBUG:dvclive:Logged precision: 0.77248
2023-02-22T17:36:52.5319392Z DEBUG:dvclive:Logged recall: 0.71641
2023-02-22T17:36:52.9912285Z DEBUG:dvclive:Step: 3
...
2023-02-22T17:38:06.3842480Z DEBUG:dvclive:Logged map95: 0.56688
2023-02-22T17:38:06.3844398Z DEBUG:dvclive:Logged map50: 0.81524
2023-02-22T17:38:06.3848437Z DEBUG:dvclive:Logged precision: 0.79571
2023-02-22T17:38:06.3851546Z DEBUG:dvclive:Logged recall: 0.72783
2023-02-22T17:38:06.8256364Z DEBUG:dvclive:Step: 4
...
2023-02-22T17:39:22.6577834Z DEBUG:dvclive:Logged map95: 0.57941
2023-02-22T17:39:22.6580463Z DEBUG:dvclive:Logged map50: 0.82216
2023-02-22T17:39:22.6583565Z DEBUG:dvclive:Logged precision: 0.80248
2023-02-22T17:39:22.6588463Z DEBUG:dvclive:Logged recall: 0.74549
2023-02-22T17:39:23.3859307Z DEBUG:dvclive:Step: 5
...
2023-02-22T17:39:33.8801083Z DEBUG:dvclive:Logged map95: 0.5793409533456171
2023-02-22T17:39:33.8802401Z DEBUG:dvclive:Logged map50: 0.8219465738683116
2023-02-22T17:39:33.8803836Z DEBUG:dvclive:Logged precision: 0.8011659733161465
2023-02-22T17:39:33.8805301Z DEBUG:dvclive:Logged recall: 0.7454864820648035
2023-02-22T17:39:34.5532907Z DEBUG:dvclive:Step: 6
...
2023-02-22T17:39:51.2596494Z Updating lock file 'dvc.lock'
2023-02-22T17:39:51.3221732Z Use `dvc push` to send your updates to remote storage.
2023-02-22T17:39:52.2013867Z 23 files pushed
Thanks for the logs @haimat .
How did you install DVC in the GitHub workflow?
The issue might be that DVCLive requires the DVC Python API which is only available when you install DVC via pip install
.
@daavoo We install DVC in the Github workflow like this:
- uses: iterative/setup-dvc@v1
- uses: iterative/setup-cml@v1
Isn't that the recommended way?
Isn't that the recommended way?
I guess it depends on the needs.
We have an open issue in DVCLive to remove the need for DVC to be installed as a Python library (https://github.com/iterative/dvclive/issues/397).
Unfortunately, in order to use live metrics you need to install DVC instead via pip install dvc
, for now.
@daavoo Thanks, I changed this now, so that DVC is being installed via pip install dvc
, and how I can see the live progress of the experiment from Github in Studio. There is another issue now though: When I select that run from the list in Studio and click the "Plots" button, then I get the following info:
Or if I select both last entries:
What am I missing here?
@daavoo btw. I have all the TSV metrics files from DVCLive tracked in git, which is not being pushed from the CI runner on GitHub. I do, however, push all changes after the GitHub CI action back to the DVC remote. Might that be the problem, do I need to track these DVCLive metrics files with DVC instead of git?
I have all the TSV metrics files from DVCLive tracked in git, which is not being pushed from the CI runner on GitHub.
I think that might be the problem. How do your dvc.yaml
and the step in GitHub workflows look?
Are you pushing any git changes in the CI?
You should also push the git changes in your workflow not only for metrics but also for DVC metadata files like the dvc.lock
. One way to do it is by using https://cml.dev/doc/ref/pr
There is a full example project here https://github.com/iterative/example-get-started-experiments/ but a stripped-down version would look like this:
- name: training
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
STUDIO_TOKEN: ${{ secrets.STUDIO_TOKEN }}
STUDIO_REPO_URL: ${{ secrets.STUDIO_REPO_URL }}
run: |
dvc pull
cml ci --fetch-depth 0
dvc exp run
dvc push
cml pr --squash --skip-ci .
@daavoo Thanks for your reply. Here is my dvc.yaml
file:
stages:
prepare:
cmd: python src/dvc/prepare-data.py
deps:
- src/dvc/prepare-data.py
- data/yolo-workspace-data-config.yaml
- data/${dataset}
params:
- dataset
- prepare.ratio
outs:
- data/_yolo_config.yaml
- data/workspace
training:
cmd: python src/dvc/yolov8-trainer.py
deps:
- src/dvc/yolov8-trainer.py
- data/_yolo_config.yaml
- data/workspace
params:
- dataset
- training.model
- training.image_size
- training.epochs
- training.batch
- training.device
- training.random_seed
outs:
- training/${dataset}
metrics:
- evaluation/results.json
plots:
- mAP:
x: step
y:
dvclive/plots/metrics/map95.tsv: map95
dvclive/plots/metrics/map50.tsv: map50
title: Model Performance
x_label: epoch
y_label: value
And here our relevant Github action:
# Initialise workspace
mv params-aime.yaml params.yaml
DATASET_NAME=$(grep "dataset:" params.yaml | cut -d'"' -f 2)
pip install -r src/dvc/requirements.txt
# Setup and run DVC
dvc remote add -d -f aime /dvcdata
dvc config cache.type copy
dvc pull --remote aime
dvc repro --force
dvc push --remote aime
# Create plots in report
grep -v ".png" dvclive/report.md >> report.md;
dvc plots show --show-vega mAP > vega.json
vl2png vega.json -s 1.5 > plot1.png
echo '![](./plot1.png "Model Performance")' >> report.md
# Commit report to Github via CML
cml comment create report.md
We are not using a Github PR for training via CI - yet. Is there really a need for it? I am happy to trigger in the main branch - but I am also willing to change that if it makes sense. But we actually don't push anything back to Github, just to DVC, as you can see above.
So do I understand correctly, that we would need to push the results and changes from the CI runner back to Github repo too?
OK I fixed this by pushing the changes from CI back to Github 👍
Following the official Studio docs on live updates I have added two enviroment variables to our Github CI workflow:
However, I don't see any live updates in Studio. I do see the final results of the experiment, but not anything while it is running. The code via DVCLive is quite simple:
This function is called at every training epoch. What am I missing?