I was wondering if it will be possible to generate a Dockerfile to generate a GraphNeT docker image and run it inside a container. The idea behind this is that, when running on a container, we have full control of an isolated environment in the case we experience some issues during training/inference and we can stop it without disturbing other processes running on a cluster outside the container.
It happened to me that I stopped a training doing Ctrl+C on the terminal where I was running it, but somehow the GPUs got frozen with ghost processes after the training script was stopped. I tried to stop them manually using kill commands on the terminal, but then, the processes with a given PID appeared as N/A when using nvtop, and when typing nvidia-smi there were not even processes using the GPUs even though they were being used. The next thing I tried was to shut down manually the processes using the GPUs with the next two commands:
fuser -v /dev/nvidia*
kill $(lsof -t /dev/nvidia*)
The first one of them didn't fully work all of the times whereas the second one did. In the case it didn't work, the cluster in which I was running GraphNeT needed to be rebooted, making it not usable for other co-workers meanwhile...
Therefore, I encouraged GraphNeT developers to re-consider having a Dockerfile as happened in the past. I will be very happy to help with this, but I am not sure I might have all the required knowledge to do it myself alone.
Dear all,
I was wondering if it will be possible to generate a Dockerfile to generate a GraphNeT docker image and run it inside a container. The idea behind this is that, when running on a container, we have full control of an isolated environment in the case we experience some issues during training/inference and we can stop it without disturbing other processes running on a cluster outside the container.
It happened to me that I stopped a training doing
Ctrl+C
on the terminal where I was running it, but somehow the GPUs got frozen with ghost processes after the training script was stopped. I tried to stop them manually usingkill
commands on the terminal, but then, the processes with a givenPID
appeared asN/A
when usingnvtop
, and when typingnvidia-smi
there were not even processes using the GPUs even though they were being used. The next thing I tried was to shut down manually the processes using the GPUs with the next two commands:fuser -v /dev/nvidia*
kill $(lsof -t /dev/nvidia*)
The first one of them didn't fully work all of the times whereas the second one did. In the case it didn't work, the cluster in which I was running GraphNeT needed to be rebooted, making it not usable for other co-workers meanwhile...
Therefore, I encouraged GraphNeT developers to re-consider having a Dockerfile as happened in the past. I will be very happy to help with this, but I am not sure I might have all the required knowledge to do it myself alone.
Thank you very much!