Open rahadi23 opened 1 week ago
Upon further inspections, we found that this is happening because we are using a busy GPU to run the program. The solution to this issue would be to select a least busy (lowest utilized) GPU. This is possible since we are working on a multi-GPU cluster. As a preliminary, we can query the least busy GPU by using this command:
echo $(nvidia-smi --query-gpu=memory.free,index --format=csv,nounits,noheader | sort -nr | head -1 | awk '{ print $NF }')
However, it would be nice if the program can decide which GPU on-the-fly so we don't need to check which GPU to use each time.
In some cases, the GPU time is way slower than the CPU time. This is unexpected because the GPU should effortlessly overpower CPU. This leads to inaccurate analysis and interpretations.