librenms / librenms-agent

LibreNMS Agent & Scripts
GNU General Public License v2.0
116 stars 187 forks source link

fix(nvidia): fix for systems with more than 4 GPU and recent nvidia-smi version #506

Closed fbouynot closed 4 months ago

fbouynot commented 4 months ago

On a DGX-A100 with 8 GPU and nvidia-smi 535.154.05, this script has not the expected behaviour. The script is expecting that nvidia-smi prints only the informations for 5 GPU, but it's not the case anymore:

 nvidia-smi dmon -c 1 -s pucvmet  | grep -v ^# | sed 's/^ *//' | sed 's/  */,/g' | sed 's/-/0/g'
0,65,32,48,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,6,
1,64,31,44,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,0,
2,65,32,46,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,2,3,
3,62,31,47,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,0,
4,65,36,51,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,5,
5,66,35,49,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,0,
6,67,36,49,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,0,
7,67,35,49,0,0,0,0,0,0,1593,210,0,0,0,1,0,0,0,0,0,0,

The issue the is that the script will start its loop at 5 (6th GPU), adding make some double:

$ /etc/snmp/nvidia 
0,396,59,64,100,38,0,0,0,0,1593,1230,0,0,62766,4,0,0,0,0,16,6,
1,408,60,61,100,40,0,0,0,0,1593,1305,0,0,62900,4,0,0,0,0,18,16,
2,394,62,63,100,43,0,0,0,0,1593,1230,0,0,62900,4,0,0,0,0,17,8,
3,412,61,63,100,41,0,0,0,0,1593,1260,0,0,62900,4,0,0,0,0,18,15,
4,403,74,74,100,38,0,0,0,0,1593,1260,0,0,62900,4,0,0,0,0,9,3,
5,396,73,72,100,38,0,0,0,0,1593,1215,0,0,62900,4,0,0,0,0,10,3,
6,403,73,74,100,40,0,0,0,0,1593,1155,0,0,62900,4,0,0,0,0,9,4,
7,390,71,71,100,36,0,0,0,0,1593,1215,0,0,62756,4,0,0,0,0,9,3,
5,398,73,72,100,42,0,0,0,0,1593,1215,0,0,62900,4,0,0,0,0,10,3,
6,342,72,73,100,42,0,0,0,0,1593,1230,0,0,62900,4,0,0,0,0,8,3,
7,405,71,72,100,38,0,0,0,0,1593,1200,0,0,62756,4,0,0,0,0,9,8,

My fix intends to fix this by starting the loop at a variable number that is the line number in the first command.