Closed AlexandreKempf closed 5 months ago
Attention: 21 lines
in your changes are missing coverage. Please review.
Comparison is base (
170be6f
) 95.51% compared to head (506dd52
) 95.31%.
Files | Patch % | Lines |
---|---|---|
src/dvclive/monitor_system.py | 84.67% | 15 Missing and 4 partials :warning: |
src/dvclive/live.py | 87.50% | 1 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hey @AlexandreKempf, take a look at my remaining questions/suggestions, but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.
but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.
I've approved it already!
I added the doc here fyi.
Added changes based on recommendations in the documentation PR
psutil
and pynvml
by defaultplot
argument since we always want these metrics plottedmonitor_system
to Live
instead of a property|Should this have closed #81?
New feature: monitoring for harware metrics
For a first discussion on this PR content, you can look at here. It was a outdated version of this PR that only contains CPU, ram and disk metrics.
Monitoring hardware
In this PR we add the possibility for the user to monitor the GPU, CPU, RAM, and disk during one experiment. The GPU metrics are only collected if at least one GPU is detected.
To use this feature you can use it with a simple argument:
If you want to use advance features you can specify each parameter this way:
If you allow the monitoring of your system, if will track the following:
system/cpu/count
-> number of CPU coressystem/cpu/usage (%)
-> the average usage of the CPUs.system/cpu/parallelization (%)
-> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.system/ram/usage (%)
-> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.system/ram/usage (GB)
-> RAM used. Useful to increase batch size or data processed at the same time.system/ram/total (GB)
-> Total RAM in your systemsystem/disk/usage (%)
-> Amount of disk used by the partition that contain the given path, in %. By default uses "/". You can specify the paths to the partition you want to monitor. For instance, the code example above monitors/data
and/home
. Data and code often live in very different paths/volumes, so it is useful for the user to be able to specify its own path.system/disk/usage (GB)
-> Amount of disk used at a given path.system/disk/total (GB)
-> Total disk storage at a given path.system/gpu/count
-> Number of GPUs detected.system/gpu/usage (%)
-> Usage of each GPU in %.system/vram/usage (%)
-> Usage of each GPU virtual memory in %.system/vram/usage (GB)
-> Usage of each GPU virtual memory in GB.system/vram/total (GB)
-> total amount of GPU virtual memory in GB, for each GPU.Note that as several paths can be specified, the full metric name is
system/disk/usage (%)/<user defined name>
. For instance it would besystem/disk/usage (%)/data
for the/path/to/data/disk
andsystem/disk/usage (%)/home
for/home
.Note that as several GPUs can be detected, the full metric name for GPU metrics (except
count
) is suffix with/<idx>
that indicate the index of the GPU. Example:system/gpu/usage (%)/0
All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as GPU count, GPU vram total, CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.
I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.
Files generated
The metrics about the CPU are stored with the
log_metric
function. It means that the.tsv
files are stored in thedvclive/plots
folder. A specific folder,system
, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in thedvclive/metrics.json
file.Plot display
Here is what VScode extension looks like:![image](https://github.com/iterative/dvclive/assets/14785566/1295eca0-b533-49a2-ac51-c91f662ba3c5)
Here is what Studio looks like: https://studio.iterative.ai/user/AlexandreKempf/projects/image_classification-imzssxc5ew
Note that studio live update is a little buggy, but it is fixed in this PR
Note that we're calling
nvmlShutdown
andnvmlInit
at each fetch of the GPU metrics like in labML code