mlco2 / codecarbon

Track emissions from Compute and recommend ways to reduce their impact on the environment.
https://mlco2.github.io/codecarbon
MIT License
1k stars 157 forks source link

Could not find mem= after running `scontrol show job $SLURM_JOB_ID` to count SLURM-available RAM. Using the machine's total RAM. #569

Closed VincenzoPar closed 1 week ago

VincenzoPar commented 3 weeks ago

Description

I'm trying to evaluate how consumption varies during the inference of an LLM model by changing some generation hyperparameters (temperature, top_k, etc.). For each different configuration of hyperparameters, I run a cycle of 10 generations with 10 different prompts saved in the 'instructions' list, and for each inference, I create a new instance of OfflineEmissionsTracker and start it. I'm running the script on an HPC cluster using SLURM as the workload scheduler. The problem I'm having, as mentioned in the title, is that codecarbon cannot obtain the memory requested in the .sh file and uses the total memory of the node as a reference. How can I solve this? "" Below is the script for requesting resources and the part of the code where the tracker is instantiated. ""

photo_5989812149381350953_x photo_5989812149381350952_y

benoit-cty commented 2 weeks ago

Thanks for reporting this. Could you send us the output of the scontrol show job $SLURM_JOB_ID ? Because I made an update to solve this error last year in https://github.com/mlco2/codecarbon/pull/473

VincenzoPar commented 2 weeks ago

Yes! This is a simple script i made just to test the command and the output of it.

Script:

immagine

Output:

immagine

VincenzoPar commented 2 weeks ago

The SLURM version is 22.05

benoit-cty commented 1 week ago

Thanks, I think the problem come from mem_matches = re.findall(r"AllocTRES=.*?,mem=(\d+[A-Z])", scontrol_str) as you can see in your capture, there is no AllocTRES, only TRES.

Here what I got on a slurm 23.02.6 cluster:

   ReqTRES=cpu=10,mem=40000M,node=1,billing=10,gres/gpu=1
   AllocTRES=cpu=20,mem=40000M,node=1,billing=10,gres/gpu=1

So there is the capacity the user ask for and the alloction he did get.

We may be update the code look only at mem= if we don't find AllocTRES.

What do you think about this ?

benoit-cty commented 1 week ago

I open a PR for that : https://github.com/mlco2/codecarbon/pull/584/files