[Usage] Question about Distributed Training

ImahnShekhzadeh commented 1 month ago

Hi,

Thanks for this repo! In the documentation, you have the following usage example:

# All four GPUs are measured simultaneously.
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

# Measure total time and energy within the window.
monitor.begin_window("training")
for e in range(100):

    # Measurement windows can arbitrarily be overlapped.
    monitor.begin_window("epoch")
    for x, y in train_dataloader:
        y_hat = model(x)
        loss = criterion(y, y_hat)
        loss.backward()
        optim.step()
    measurement = monitor.end_window("epoch")
    print(f"Epoch {e}: {measurement.time} s, {measurement.total_energy} J")

measurement = monitor.end_window("training")

When using distributed data parallel, the training loop is actually executed four times if we have four GPUs, where the training is each time spawned on a different GPU. In this case, should the line

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

be kept, or should the monitor be

monitor = ZeusMonitor(gpu_indices=[rank])  # `rank` describes current GPU

and then the final energy in Joule is multiplied by 4?

jaywonchung commented 1 month ago

Hi @ImahnShekhzadeh! There are a couple ways.

If you're using multiple GPUs on one node, you have have the rank 0 process measure all the GPUs. That is, giving a list of all GPU indices as gpu_indices. In this case, the measurement window for all GPUs are synchronized (i.e., the window begins and ends at the same time for every GPU in gpu_indices). For stuff like data parallel training, this should be what users want.
```
if torch.distributed.get_rank() == 0:
    monitor = ZeusMonitor(gpu_indices=[0, 1, 2, 3])

if torch.distributed.get_rank() == 0:
    monitor.begin_window("entire_training")

if torch.distributed.get_rank() == 0:
    measurement = monitor.end_window("entire_training")  # This object has all four GPU measurements
```

Another way is to instantiate a monitor for each process that manages one GPU. For this one, you will need a way to aggregate measurements across all GPUs.

monitor = ZeusMonitor(gpu_indices=[torch.cuda.current_device()])

# Obtained `measurement` object in this rank

measurements = [None for _ in range(4)]
torch.distributed.all_gather_object(measurements, obj=measurement)  # One measurement object per GPU.

ImahnShekhzadeh commented 1 month ago

I see, thanks!

ml-energy / zeus

[Usage] Question about Distributed Training #79