mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1k stars 157 forks source link

Multi GPU tracking: how to interpret? #544

Open Nkluge-correa opened 1 month ago

Nkluge-correa commented 1 month ago

This is not an issue, but I have a question about interpreting the emissions.csv file from the tracker.

I am tracking a training run on 8 GPUs. I set every process (GPU) to track its consumption and flush the data every time I make a checkpoint for my model. In the end, I got 8 emission files (emissions_0, ..., emissions_7), which were tracked and flushed by every GPU I used.

Now, I don't know if I should add the results of all files (like emissions and total energy consumption) or if Code Carbon tracked and aggregated them all for me.

I have 8 emission files like this:

timestamp,project_name,run_id,duration,emissions,emissions_rate,cpu_power,gpu_power,ram_power,cpu_energy,gpu_energy,ram_energy,energy_consumed,country_name,country_iso_code,region,cloud_provider,cloud_region,os,python_version,codecarbon_version,cpu_count,cpu_model,gpu_count,gpu_model,longitude,latitude,ram_total_size,tracking_mode,on_cloud,pue
2024-05-05T04:04:43,Test,7e24a2c5-c7db-4065-94b2-edb51c0ac712,495066.1067533493,110.51787583060145,0.0002232386227273349,112.5,3893.2497087381207,188.89137411117554,15.470743219003149,261.0323906035237,25.890384321165783,302.3935181436902,,,,,,Linux-5.14.0-284.30.1.el9_2.x86_64-x86_64-with-glibc2.34,3.10.12,2.3.3,128,AMD EPYC 7713 64-Core Processor,8,8 x NVIDIA A40,7.0412,50.7088,503.71033096313477,machine,N,1.0

There is one of these files for each process (GPU). To get something like total emissions, do I have to sum all emissions from all 8 GPUs, or is the value in a single file already this sum?

inimaz commented 1 month ago

Hello @Nkluge-correa, thanks for using codecarbon!

Could you provide a snippet of how you are initializing codecarbon? If I understand it well, you are running one instance of codecarbon per GPU, all at the same time sharing the same CPU and RAM I guess.

If this is your case (1 CPU + 8 GPUs + 1 RAM) ==> each instance of codecarbon might be considering that the system is 1CPU + 1 GPU + 1 RAM ==> It is measuring CPU and RAM 8 times and counting them in the emissions which is not the real emissions.

Instead I might suggest to run codecarbon in the main process from start to finish. codecarbon uses pynvml under the hood so it can detect all your GPUs running and measure them.

For instance:

from codecarbon import EmissionsTracker

with EmissionsTracker() as tracker:
    # Your code multi process here
Nkluge-correa commented 1 month ago

Hello @inimaz! CodeCarbon is awesome, and I am happy to use it.

That makes total sense. Here is a snippet of the code I am running:

# - Python version: 3.10.12
# - transformers==4.38.0.dev0
# - torch==2.1.0+cu121
# - pyyaml==6.0.1
# - datasets==2.16.1
# - wandb==0.16.2
# - codecarbon==2.3.3
# - huggingface_hub==0.20.2
# - accelerate==0.26.1
# - sentencepiece==0.1.99
# - flash-attn==2.5.0
# - deepspeed==0.14.0

from codecarbon import EmissionsTracker

# .... Before this, we are loading models and datasets, initializing optimizers, and all of that good old deep learning boilerplate.

# Preparing everything with `accelerator`. The `prepare` method will handle the device 
# placement and distributed training.
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, lr_scheduler
)

# Using `EmissionsTracker` to track the energy consumption of the training process.
tracker = EmissionsTracker(
    project_name='test',
    log_level="critical", # Set to "critical" to silence codecarbon.
    output_dir="ckpt",
    output_file=f"emissions_{accelerator.process_index}_{slurm_job_id}.csv",
    tracking_mode='machine', # We are tracking the energy consumption of the whole machine.
)

# Start codecarbon tracking before we start the training loop.
tracker.start()

for epoch in range(starting_epoch, training_args.num_train_epochs):

    # Iterate over the batches of data in the current epoch.
    for step, batch in enumerate(active_dataloader, start=1):

        with accelerator.accumulate(model):

            # Forward pass the batch through the model and get the loss.
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.detach().float()
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        if accelerator.sync_gradients:
            progress_bar.update(1)
            completed_steps += 1

        # Here, we are going to save the model checkpoint every `checkpointing_steps`.
        accelerator.wait_for_everyone()

        # Check if `checkpointing_steps` is an integer
        if isinstance(extra_args.checkpointing_steps, int):

            if completed_steps % extra_args.checkpointing_steps == 0 and completed_steps > 0:

                checkpoint_dir = f"step_{completed_steps}"
                # Join the output directory with the current checkpoint directory.
                checkpoint_dir = os.path.join(training_args.output_dir, checkpoint_dir)

                # Save accelerator state.
                accelerator.save_state(checkpoint_dir)

                # Save the model checkpoint to the current checkpoint directory.
                unwrapped_model = accelerator.unwrap_model(model)
                unwrapped_model.save_pretrained(
                    checkpoint_dir, is_main_process=accelerator.is_main_process, 
                    save_function=accelerator.save, state_dict=accelerator.get_state_dict(model)
                )

                # Flush tracker (all instances) at the checkpoint 
                tracker.flush()

        # If we have reached the `max_steps`, break the step loop.
        if training_args.max_steps > 0 and completed_steps >= training_args.max_steps:
            break

    # If we have reached the `max_steps`, break the epoch loop
    if training_args.max_steps > 0 and completed_steps >= training_args.max_steps:
        break

# Resume codecarbon tracking.
tracker.stop()

Should I move the initialization of EmissionsTracker and its flushing to inside an if accelerator?is_main_process: block? Then, only process 0 would track and flush. But would this give me the correct values for GPU, CPU, and RAM power and energy consumption?

I hope my question is more clear now. And thanks for the super quick reply @inimaz!

Nkluge-correa commented 1 month ago

To make my question make more sense. Let me expose my doubts step-by-step:

  1. The code below generates 8 emissions.csv files. One per process (GPU). We flush the tracker at every checkpoint.

tracker = EmissionsTracker(
    tracking_mode='machine', # We are tracking the energy consumption of the whole machine.
)

tracker.start()

for epoch in range(starting_epoch, training_args.num_train_epochs):
    for step, batch in enumerate(active_dataloader, start=1):

        with accelerator.accumulate(model):

            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        if accelerator.sync_gradients:
            progress_bar.update(1)
            completed_steps += 1

        # Here, we are going to save the model checkpoint every `checkpointing_steps`.
        accelerator.wait_for_everyone()

        if completed_steps % extra_args.checkpointing_steps == 0 and completed_steps > 0:

            # Flush tracker (all instances) at the checkpoint 
            tracker.flush()

tracker.stop()
  1. Let us consider only one of these files (emissions_0.csv) and only the first flush (the first row of the CSV file). According to the tracker, 36483 seconds (~10 hours for the first checkpoint) generated 8.15 KgCO2eq.

  2. CPU power is 112.5 W, GPU power is 711 W, and RAM power is 188 W. Given that I am using 8 GPUs, every GPU is running on 88 W, right? This seems wrong. If this were a measure of a single GPU, 711 W is way above the 300 W limit for an A40. Meanwhile, if this is the shared wattage, this is also wrong. According to W&B (and watch nvidia-smi), GPUs run at 80% capacity (~ 240 W) during this experiment. This also leads me to question the results related to CPU and RAM.

W B Chart 5_7_2024, 10_44_47 AM

  1. According to the current documentation, cpu_energy, gpu_energy, and ram_energy are measured per device. The values for 10 hours of measurements are 1.14 kWh, 19.25 kWh, and 1.90 kWh, respectively. This also seems weird. If every single GPU is running on 88 W (according to CodeCarbon) for 10 hours and $E = \frac{88 \times 10}{1000}$, my GPU energy should be 0.88 kWh. If every single GPU is running on 230 W (according to W&B and the nvidia-smi) for 10 hours and $E = \frac{230 \times 10}{1000}$, my GPU energy should be 2.3 kWh. How am I getting 19.25 kWh?

  2. Finally, let us suppose the values from the tracker are correct. Summing cpu_energy, gpu_energy, and ram_energy equates to a total energy consumption of 22.3 kWh. Based on CodeCarbon's methodology, emissions equal $C \times E$. According to the data CodeCarbon uses, Germany (the locality of the experiments) has a $C$ (carbon intensity of the energy grid) of 0.37 KgCO2.eq/KWh (please correct me if I am wrong). Hence, *22.3 kWh 0.37 KgCO2.eq/KWh gives me the measured 8.15 KgCO2.eq.** However, if cpu_energy, gpu_energy, and ram_energy are measured per process, I should count CPU and RAM just once (because CPU and RAM are shared), but multiply the gpu_energy (measured per device) by 8. Hence, to get the total emissions, I would need to do something like:

$$(\text{GPU}(19.25 \times 8) + \text{CPU}(1.14) + \text{RAM}(1.90)) \times 0.37$$

Right?

Sorry for asking so many questions, but in the process of double and triple-checking the results, many things stopped making sense to me, which made me stop trusting the results I was getting. The values mentioned are in this csv file. Right now, I feel that going with the values tracked by W&B and doing the calculation by hand gives more accurate results than what I am getting from the EmissionsTracker.

Could someone please give me some light on this?? 🤔

inimaz commented 1 month ago

I haven't used accelerator to be honest, but it looks like it creates the multiple processes from

...
with accelerator.accumulate(model):
...

if I am not mistaken? If that is the case, then change the output file of the emission tracker to output_file=f"emissions_{slurm_job_id}.csv" and you might be good to go.

Nkluge-correa commented 1 month ago

@inimaz, Could you look at my last comment? Do you think all of these weird measurements are a result of having too many processes running? It doesn't seem like it to me.

inimaz commented 1 month ago

Didn't see your last comment there. I haven't seen the codecarbon logs so I cannot be sure, but to me it looks like you have 8 instances of codecarbon and all of them are finding the all 8 GPUs :D

my GPU energy should be 2.3 kWh. How am I getting 19.25 kWh

So in that calculus, 2.3*8 = 18.4 kWh ~19 ?

Try to use the EmissionTracker as a singleton instead. Define it once and share it between all your processes

Nkluge-correa commented 1 month ago

Thanks, @inimaz; I will test with a singleton instead. But if GPU energy is measured for all processes and aggregated, maybe someone should consider making this more explicit in the documentation. For example, the wording of this table can be a little misleading ("gpu_energy | Energy used per GPU (kWh))). It is apparently not per GPU...

inimaz commented 1 month ago

True, thanks for pointing it out!

Side note: There is a way to filter by gpu_id if you still want to have the split per GPU.

EmissionsTracker(gpu_ids=[0,1]) # Will only measure the emissions of gpus 0 and 1

But still you will get the cpu power and ram power counted on your emissions. So you will have to substract them?

Nkluge-correa commented 1 month ago

I didn't know this. Thank you for pointing it out! I think with a singleton tracker, I will get my measures per GPU (as intended), and since CPU and RAM are shared by all processes, if I want just the energy of the GPU, as you said, I need to subtract.