Closed ImahnShekhzadeh closed 1 month ago
Hi @ImahnShekhzadeh! There are a couple ways.
If you're using multiple GPUs on one node, you have have the rank 0 process measure all the GPUs. That is, giving a list of all GPU indices as gpu_indices
. In this case, the measurement window for all GPUs are synchronized (i.e., the window begins and ends at the same time for every GPU in gpu_indices
). For stuff like data parallel training, this should be what users want.
if torch.distributed.get_rank() == 0:
monitor = ZeusMonitor(gpu_indices=[0, 1, 2, 3])
if torch.distributed.get_rank() == 0:
monitor.begin_window("entire_training")
if torch.distributed.get_rank() == 0:
measurement = monitor.end_window("entire_training") # This object has all four GPU measurements
Another way is to instantiate a monitor for each process that manages one GPU. For this one, you will need a way to aggregate measurements across all GPUs.
monitor = ZeusMonitor(gpu_indices=[torch.cuda.current_device()])
# Obtained `measurement` object in this rank
measurements = [None for _ in range(4)]
torch.distributed.all_gather_object(measurements, obj=measurement) # One measurement object per GPU.
I see, thanks!
Hi,
Thanks for this repo! In the documentation, you have the following usage example:
When using distributed data parallel, the training loop is actually executed four times if we have four GPUs, where the training is each time spawned on a different GPU. In this case, should the line
be kept, or should the monitor be
and then the final energy in Joule is multiplied by 4?