AlexandreKempf commented 8 months ago

New feature: monitoring for CPU

Link to GPU monitoring. Link to fix Studio

Monitoring CPU hardware

In this PR we add the possibility for the user to monitor the CPU, RAM, and disk during one experiment.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import CPUMonitor

with Live(monitor_system=True) as live:
    live.cpu_monitor = CPUMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/disk", "home": "/home"}))

And this is how the editor help should looks like: Screenshot from 2024-02-15 12-27-04

If you allow the monitoring of your system, if will track:

system/cpu/count -> number of CPU cores
system/cpu/usage (%) -> the average usage of the CPUs.
system/cpu/parallelization (%) -> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.
- system/ram/usage (%) -> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.
- system/ram/usage (GB) -> RAM used. Useful to increase batch size or data processed at the same time.
- system/ram/total (GB) -> Total RAM in your system
- system/disk/usage (%) -> Amount of disk used at a given path in %. By default uses "." meaning where the python executor was launched. You can specify the paths you want to monitor. For instance the code example above monitor /data and /home. Data and code often lives in very different paths/volumes, so it is useful for the user to be able to specify its own path. Note that as their can be several paths specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.
- system/disk/usage (GB) -> Amount of disk used at a given path.
- system/disk/total (GB) -> Amount of total disk storage at a given path.

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like:

Here is what Studio looks like:

Note that studio update is a little buggy, but it is fixed in this PR

[x] ❗ I have followed the Contributing to DVCLive guide.
[ ] 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

dberenbaum commented 8 months ago

@AlexandreKempf Sorry for the delay here. For any PR like this where you introduce a new feature, it would be great if you could include a demo of the feature in the PR. Some useful ways to demo it can be:

Record a video via screencast, https://www.loom.com, https://asciinema.org, etc.
Include an example notebook, script, repo, etc.
Write a draft of the docs PR

Once it's merged, you can reuse it by posting it to https://iterativeai.slack.com/archives/C064XEWG2BD

dberenbaum commented 8 months ago

Also note that windows tests are failing

AlexandreKempf commented 8 months ago

@dberenbaum Update the description with a video, as you asked, and my Studio dashboard. I'm working on tests now.

dberenbaum commented 8 months ago

Sorry if I forget this from your previous work on this, but is GPU usage a separate callback you plan to add later?

plots use steps as x by default when we log metrics. Using timestamps makes more sense if available (timestamp=True). It is weird for the monitoring resources to be plotted with steps.

Can we configure these plots to use timestamps as the x-axis? WDYT about putting these into a separate dir like dvclive/plots/system?

I guess it goes beyond the idea of a generic callback system, but seems like a cleaner UX. I'm also unsure whether we need to expose a whole callbacks system to users for this, or whether there's a simpler way for them to enable/disable logging system metrics (even if we use callbacks internally).

it can be nice to have a time flag system to know where we are in training (steps are not very informative). This is very useful to align the hardware resources with specific algorithm parts.

Sorry, I don't follow. How is it different from the previous point?

shcheklein commented 8 months ago

Thanks for the description and video @AlexandreKempf . It's way easier to review it now and understand.

A few questions, before reviewing the code itself in details:

Why do we need a separate callback? It's a bit confusing to pass a callback into a callback. Is it what other product do? E.g. W&B , neptune, comet? Can we review and summarize how they do this?

AlexandreKempf commented 8 months ago

@dberenbaum

Can we configure these plots to use timestamps as the x-axis? WDYT about putting these into a separate dir like dvclive/plots/system?

Yeah, that is a cool idea. Initially, I put them into dvclive/plots/metrics/system because I use the log_metric function that naively puts them here. And I tried to change the dvc.yaml to have the system subdirectory to take timestamp as X. But this doesn't work if a parent directory is set to another X (in that case, dvclive/plots/metrics was set to step.

I guess it goes beyond the idea of a generic callback system, but seems like a cleaner UX. I'm also unsure whether we need to expose a whole callbacks system to users for this, or whether there's a simpler way for them to enable/disable logging system metrics (even if we use callbacks internally).

I agree with the UX, and I also find it more modular and easy to extend (with GPU, TPU, pricing for cloud ...). But I agree with you and Ivan, the user doesn't need to know how it works under the hood. And the callback in a callback is not great either. I'll change it to only take a monitor_system boolean.

Sorry, I don't follow. How is it different from the previous point?

I was thinking of something like this:

AlexandreKempf commented 8 months ago

@dberenbaum Revert all the changes about timestamp so that we use step instead. We might change it in the future, but for now, it produces this result when we compare two experiments: Which is not satisfying. I tried to find a way to set the x-axis as relative to the first timestamp on vega-lite but I couldn't find the right config to make it work.

Also I fixed the lightning issue and made sure it works with the vanilla one (with Live() as live).

dberenbaum commented 8 months ago

I tried to find a way to set the x-axis as relative to the first timestamp on vega-lite but I couldn't find the right config to make it work.

I think we could log the relative timestamp to the tsv file, but I'm fine to make this a follow-up issue and proceed with step for now.

AlexandreKempf commented 8 months ago

I think we could log the relative timestamp to the tsv file, but I'm fine to make this a follow-up issue and proceed with step for now.

It is sad that is the only way of dealing with it, and that Vega-lite doesn't seem to support that (even with all the transformations they provide). I guess we'll leave it as it is for now. I would love to continue working on the timestamp x-axis on another PR when we have time, but it also mean that we will need to have:

[ ] handle subdirectory specification in dvc.yaml plots configuration (which is an expected behavior from a naive user I would say). In this feature I mean that if the x-axis is specified for dvclive/plots/metrics, we should be able to add a configuration below that says dvclive/plots/metrics/system with another x-axis (timestamp in this case).

[ ] save the initial timestamp of the experiment to save it to the vega-lite configuration. I saw that with some transformation it is possible to subtract values to the list of datapoints in vega-lite. If we have this value accessible by the config, we might have the relative timestamp directly
- [ ] create a DVC plot template for relative timeseries (that subtract the initial timestamp of the experiment).
I find this roadmap cleaner than trying to hack our way into the solution. What do you think?

dberenbaum commented 8 months ago

I find this roadmap cleaner than trying to hack our way into the solution. What do you think?

Let's create a separate issue and discuss there.

Do we also need a follow-up issue for gpu metrics? Any other items that you consider out of scope but want to open an issue for?

Is it otherwise ready for review?

AlexandreKempf commented 8 months ago

Yeah, I was waiting for this PR to be reviewed before I started working on GPU monitoring. No need to add the issue for the GPU, It'll be done by tomorrow on another PR. I'll create any relevant issues I can think of and that are missing here.

codecov-commenter commented 8 months ago

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (170be6f) 95.51% compared to head (c4485d7) 95.32%.

Files	Patch %	Lines
src/dvclive/monitor_system.py	84.61%	10 Missing and 4 partials :warning:
src/dvclive/live.py	87.50%	1 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #773 +/- ## ========================================== - Coverage 95.51% 95.32% -0.19% ========================================== Files 55 57 +2 Lines 3542 3742 +200 Branches 317 337 +20 ========================================== + Hits 3383 3567 +184 - Misses 111 122 +11 - Partials 48 53 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dberenbaum commented 8 months ago

Note there are errors on windows

dberenbaum commented 8 months ago

@AlexandreKempf You can just ping here or request a review again once you feel it's ready. 🙏

AlexandreKempf commented 8 months ago

@shcheklein @dberenbaum Please consider this PR also, because it will probably be merge in this branch before we merge to main.

AlexandreKempf commented 8 months ago

@shcheklein @dberenbaum

The following PR concerning the GPU metrics changes more code than I usually planned. It is not a simple addition to this PR. It needs to change a lot of this PR code as well. So, to stop wasting your time on this PR, we will move the discussion to the final version of the code.

The next PR is here.

I'll close this PR, let's continue the discussion on #785

iterative / dvclive

Monitors CPU, RAM, and disk usage #773

New feature: monitoring for CPU

Monitoring CPU hardware

Files generated

Plot display

Codecov Report