Closed AlexandreKempf closed 8 months ago
@AlexandreKempf Sorry for the delay here. For any PR like this where you introduce a new feature, it would be great if you could include a demo of the feature in the PR. Some useful ways to demo it can be:
Once it's merged, you can reuse it by posting it to https://iterativeai.slack.com/archives/C064XEWG2BD
Also note that windows tests are failing
@dberenbaum Update the description with a video, as you asked, and my Studio dashboard. I'm working on tests now.
Sorry if I forget this from your previous work on this, but is GPU usage a separate callback you plan to add later?
- plots use steps as
x
by default when we log metrics. Using timestamps makes more sense if available (timestamp=True
). It is weird for the monitoring resources to be plotted with steps.
Can we configure these plots to use timestamps as the x-axis? WDYT about putting these into a separate dir like dvclive/plots/system
?
I guess it goes beyond the idea of a generic callback system, but seems like a cleaner UX. I'm also unsure whether we need to expose a whole callbacks system to users for this, or whether there's a simpler way for them to enable/disable logging system metrics (even if we use callbacks internally).
- it can be nice to have a time flag system to know where we are in training (steps are not very informative). This is very useful to align the hardware resources with specific algorithm parts.
Sorry, I don't follow. How is it different from the previous point?
Thanks for the description and video @AlexandreKempf . It's way easier to review it now and understand.
A few questions, before reviewing the code itself in details:
@dberenbaum
Can we configure these plots to use timestamps as the x-axis? WDYT about putting these into a separate dir like
dvclive/plots/system
?
Yeah, that is a cool idea. Initially, I put them into dvclive/plots/metrics/system
because I use the log_metric
function that naively puts them here. And I tried to change the dvc.yaml to have the system
subdirectory to take timestamp
as X. But this doesn't work if a parent directory is set to another X (in that case, dvclive/plots/metrics
was set to step
.
I guess it goes beyond the idea of a generic callback system, but seems like a cleaner UX. I'm also unsure whether we need to expose a whole callbacks system to users for this, or whether there's a simpler way for them to enable/disable logging system metrics (even if we use callbacks internally).
I agree with the UX, and I also find it more modular and easy to extend (with GPU, TPU, pricing for cloud ...). But I agree with you and Ivan, the user doesn't need to know how it works under the hood. And the callback in a callback is not great either. I'll change it to only take a monitor_system
boolean.
Sorry, I don't follow. How is it different from the previous point?
I was thinking of something like this:
@dberenbaum Revert all the changes about timestamp
so that we use step
instead. We might change it in the future, but for now, it produces this result when we compare two experiments:
Which is not satisfying.
I tried to find a way to set the x-axis as relative to the first timestamp on vega-lite but I couldn't find the right config to make it work.
Also I fixed the lightning issue and made sure it works with the vanilla one (with Live() as live
).
I tried to find a way to set the x-axis as relative to the first timestamp on vega-lite but I couldn't find the right config to make it work.
I think we could log the relative timestamp to the tsv file, but I'm fine to make this a follow-up issue and proceed with step for now.
I think we could log the relative timestamp to the tsv file, but I'm fine to make this a follow-up issue and proceed with step for now.
It is sad that is the only way of dealing with it, and that Vega-lite doesn't seem to support that (even with all the transformations they provide). I guess we'll leave it as it is for now. I would love to continue working on the timestamp x-axis on another PR when we have time, but it also mean that we will need to have:
x-axis
is specified for dvclive/plots/metrics
, we should be able to add a configuration below that says dvclive/plots/metrics/system
with another x-axis
(timestamp in this case).[ ] save the initial timestamp of the experiment to save it to the vega-lite configuration. I saw that with some transformation it is possible to subtract values to the list of datapoints in vega-lite. If we have this value accessible by the config, we might have the relative timestamp directly
I find this roadmap cleaner than trying to hack our way into the solution. What do you think?
I find this roadmap cleaner than trying to hack our way into the solution. What do you think?
Let's create a separate issue and discuss there.
Do we also need a follow-up issue for gpu metrics? Any other items that you consider out of scope but want to open an issue for?
Is it otherwise ready for review?
Yeah, I was waiting for this PR to be reviewed before I started working on GPU monitoring. No need to add the issue for the GPU, It'll be done by tomorrow on another PR. I'll create any relevant issues I can think of and that are missing here.
Attention: 16 lines
in your changes are missing coverage. Please review.
Comparison is base (
170be6f
) 95.51% compared to head (c4485d7
) 95.32%.
Files | Patch % | Lines |
---|---|---|
src/dvclive/monitor_system.py | 84.61% | 10 Missing and 4 partials :warning: |
src/dvclive/live.py | 87.50% | 1 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Note there are errors on windows
@AlexandreKempf You can just ping here or request a review again once you feel it's ready. 🙏
@shcheklein @dberenbaum
Please consider this PR also, because it will probably be merge in this branch before we merge to main
.
@shcheklein @dberenbaum
The following PR concerning the GPU metrics changes more code than I usually planned. It is not a simple addition to this PR. It needs to change a lot of this PR code as well. So, to stop wasting your time on this PR, we will move the discussion to the final version of the code.
The next PR is here.
I'll close this PR, let's continue the discussion on #785
New feature: monitoring for CPU
Link to GPU monitoring. Link to fix Studio
Monitoring CPU hardware
In this PR we add the possibility for the user to monitor the CPU, RAM, and disk during one experiment.
To use this feature you can use it with a simple argument:
If you want to use advance features you can specify each parameter this way:
And this is how the editor help should looks like:
If you allow the monitoring of your system, if will track:
system/cpu/count
-> number of CPU coressystem/cpu/usage (%)
-> the average usage of the CPUs.system/cpu/parallelization (%)
-> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.system/ram/usage (%)
-> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.system/ram/usage (GB)
-> RAM used. Useful to increase batch size or data processed at the same time.system/ram/total (GB)
-> Total RAM in your systemsystem/disk/usage (%)
-> Amount of disk used at a given path in %. By default uses "." meaning where the python executor was launched. You can specify the paths you want to monitor. For instance the code example above monitor/data
and/home
. Data and code often lives in very different paths/volumes, so it is useful for the user to be able to specify its own path. Note that as their can be several paths specified, the full metric name issystem/disk/usage (%)/<user defined name>
. For instance it would besystem/disk/usage (%)/data
for the/path/to/data/disk
andsystem/disk/usage (%)/home
for/home
.system/disk/usage (GB)
-> Amount of disk used at a given path.system/disk/total (GB)
-> Amount of total disk storage at a given path.All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.
I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.
Files generated
The metrics about the CPU are stored with the
log_metric
function. It means that the.tsv
files are stored in thedvclive/plots
folder. A specific folder,system
, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in thedvclive/metrics.json
file.Plot display
Here is what VScode extension looks like:
Here is what Studio looks like:
Note that studio update is a little buggy, but it is fixed in this PR