graphnet-team / graphnet

A Deep learning library for neutrino telescopes
https://graphnet-team.github.io/graphnet/
Apache License 2.0
94 stars 94 forks source link

Docker image request #741

Open IvanMM27 opened 3 months ago

IvanMM27 commented 3 months ago

Dear all,

I was wondering if it will be possible to generate a Dockerfile to generate a GraphNeT docker image and run it inside a container. The idea behind this is that, when running on a container, we have full control of an isolated environment in the case we experience some issues during training/inference and we can stop it without disturbing other processes running on a cluster outside the container.

It happened to me that I stopped a training doing Ctrl+C on the terminal where I was running it, but somehow the GPUs got frozen with ghost processes after the training script was stopped. I tried to stop them manually using kill commands on the terminal, but then, the processes with a given PID appeared as N/A when using nvtop, and when typing nvidia-smi there were not even processes using the GPUs even though they were being used. The next thing I tried was to shut down manually the processes using the GPUs with the next two commands:

The first one of them didn't fully work all of the times whereas the second one did. In the case it didn't work, the cluster in which I was running GraphNeT needed to be rebooted, making it not usable for other co-workers meanwhile...

Therefore, I encouraged GraphNeT developers to re-consider having a Dockerfile as happened in the past. I will be very happy to help with this, but I am not sure I might have all the required knowledge to do it myself alone.

Thank you very much!