iterative / dvclive

📈 Log and track ML metrics, parameters, models with Git and/or DVC
https://dvc.org/doc/dvclive
Apache License 2.0
161 stars 33 forks source link

monitor GPU ressources #785

Closed AlexandreKempf closed 5 months ago

AlexandreKempf commented 5 months ago

New feature: monitoring for harware metrics

For a first discussion on this PR content, you can look at here. It was a outdated version of this PR that only contains CPU, ram and disk metrics.

Monitoring hardware

In this PR we add the possibility for the user to monitor the GPU, CPU, RAM, and disk during one experiment. The GPU metrics are only collected if at least one GPU is detected.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import SystemMonitor

with Live() as live:
    live.system_monitor = SystemMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/directory", "home": "/home"}))

If you allow the monitoring of your system, if will track the following:

Note that as several paths can be specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.

Note that as several GPUs can be detected, the full metric name for GPU metrics (except count) is suffix with /<idx> that indicate the index of the GPU. Example: system/gpu/usage (%)/0

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as GPU count, GPU vram total, CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like: image

Here is what Studio looks like: https://studio.iterative.ai/user/AlexandreKempf/projects/image_classification-imzssxc5ew

Note that studio live update is a little buggy, but it is fixed in this PR

Note that we're calling nvmlShutdown and nvmlInit at each fetch of the GPU metrics like in labML code

codecov-commenter commented 5 months ago

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (170be6f) 95.51% compared to head (506dd52) 95.31%.

Files Patch % Lines
src/dvclive/monitor_system.py 84.67% 15 Missing and 4 partials :warning:
src/dvclive/live.py 87.50% 1 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #785 +/- ## ========================================== - Coverage 95.51% 95.31% -0.20% ========================================== Files 55 57 +2 Lines 3542 3840 +298 Branches 317 348 +31 ========================================== + Hits 3383 3660 +277 - Misses 111 127 +16 - Partials 48 53 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dberenbaum commented 5 months ago

Hey @AlexandreKempf, take a look at my remaining questions/suggestions, but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

shcheklein commented 5 months ago

but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

I've approved it already!

AlexandreKempf commented 5 months ago

I added the doc here fyi.

AlexandreKempf commented 5 months ago

Added changes based on recommendations in the documentation PR

mattseddon commented 5 months ago

Should this have closed #81?