AlexandreKempf commented 5 months ago

New feature: monitoring for harware metrics

For a first discussion on this PR content, you can look at here. It was a outdated version of this PR that only contains CPU, ram and disk metrics.

Monitoring hardware

In this PR we add the possibility for the user to monitor the GPU, CPU, RAM, and disk during one experiment. The GPU metrics are only collected if at least one GPU is detected.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import SystemMonitor

with Live() as live:
    live.system_monitor = SystemMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/directory", "home": "/home"}))

If you allow the monitoring of your system, if will track the following:

system/cpu/count -> number of CPU cores
system/cpu/usage (%) -> the average usage of the CPUs.
system/cpu/parallelization (%) -> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.
- system/ram/usage (%) -> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.
- system/ram/usage (GB) -> RAM used. Useful to increase batch size or data processed at the same time.
- system/ram/total (GB) -> Total RAM in your system
- system/disk/usage (%) -> Amount of disk used by the partition that contain the given path, in %. By default uses "/". You can specify the paths to the partition you want to monitor. For instance, the code example above monitors /data and /home. Data and code often live in very different paths/volumes, so it is useful for the user to be able to specify its own path.
- system/disk/usage (GB) -> Amount of disk used at a given path.
- system/disk/total (GB) -> Total disk storage at a given path.
- system/gpu/count -> Number of GPUs detected.
- system/gpu/usage (%) -> Usage of each GPU in %.
- system/vram/usage (%) -> Usage of each GPU virtual memory in %.
- system/vram/usage (GB) -> Usage of each GPU virtual memory in GB.
- system/vram/total (GB) -> total amount of GPU virtual memory in GB, for each GPU.

Note that as several paths can be specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.

Note that as several GPUs can be detected, the full metric name for GPU metrics (except count) is suffix with /<idx> that indicate the index of the GPU. Example: system/gpu/usage (%)/0

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as GPU count, GPU vram total, CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like:

Here is what Studio looks like: https://studio.iterative.ai/user/AlexandreKempf/projects/image_classification-imzssxc5ew

Note that studio live update is a little buggy, but it is fixed in this PR

Note that we're calling `nvmlShutdown` and `nvmlInit` at each fetch of the GPU metrics like in labML code

[x] ❗ I have followed the Contributing to DVCLive guide.
[x] 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here: https://github.com/iterative/dvc.org/pull/5138

codecov-commenter commented 5 months ago

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (170be6f) 95.51% compared to head (506dd52) 95.31%.

Files	Patch %	Lines
src/dvclive/monitor_system.py	84.67%	15 Missing and 4 partials :warning:
src/dvclive/live.py	87.50%	1 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #785 +/- ## ========================================== - Coverage 95.51% 95.31% -0.20% ========================================== Files 55 57 +2 Lines 3542 3840 +298 Branches 317 348 +31 ========================================== + Hits 3383 3660 +277 - Misses 111 127 +16 - Partials 48 53 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dberenbaum commented 5 months ago

Hey @AlexandreKempf, take a look at my remaining questions/suggestions, but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

shcheklein commented 5 months ago

but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

I've approved it already!

AlexandreKempf commented 5 months ago

I added the doc here fyi.

AlexandreKempf commented 5 months ago

Added changes based on recommendations in the documentation PR

install psutil and pynvml by default
removed plot argument since we always want these metrics plotted
add method monitor_system to Live instead of a property|

mattseddon commented 5 months ago

Should this have closed #81?

iterative / dvclive

monitor GPU ressources #785

New feature: monitoring for harware metrics

Monitoring hardware

Files generated

Plot display

Note that we're calling `nvmlShutdown` and `nvmlInit` at each fetch of the GPU metrics like in labML code

Codecov Report

iterative / dvclive

monitor GPU ressources #785

New feature: monitoring for harware metrics

Monitoring hardware

Files generated

Plot display

Note that we're calling nvmlShutdown and nvmlInit at each fetch of the GPU metrics like in labML code

Codecov Report

Note that we're calling `nvmlShutdown` and `nvmlInit` at each fetch of the GPU metrics like in labML code