Process on GPU killed after long run and/or restart

Qcellaris commented 2 years ago

Dear developers,

I was using PySPH on my old GPU (GeForce GTX 680) for a while and I recently started using it on a newer GPU (NVIDIA TITAN V) as well. Unfortunately, there are some strange issues showing up, so hopefully someone can help me out here:

When I run a simulation for a longer while, e.g. 8 hours or so, it suddenly gets killed. The resulting error log can be find in the attachment.
When I try to restart the simulation from any of the previous restart files it gets killed again right away while showing the same error messages.
The issue also shows up when want to restart a simulation that has run for a short time, say 30 minutes, and which finished correctly.

This problem doesn't appear on my GTX 680 but on that machine I am using an older version of PyOpenCL. I tried using this same older version of PyOpenCL on the TITAN V but it didn't solve the issue.

Best,

Stephan

gpu_err.log

Qcellaris commented 2 years ago

I have also tested the code on the HPC cluster where we are using TESLA V100s and I run into the same errors. I have attached the corresponding error log here as well but I don't think it really gives us any new information.

slurm-76649.txt

inducer commented 2 years ago

prabhuramachandran commented 2 years ago

Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better.

Qcellaris commented 2 years ago

I don't see any blow up or large increase in the domain size when I look at the results of the last data frame. The crash happens after hours of simulation on the GPU (approximately 70k iterations). If I restart from the last output file on a CPU it just continues fine without blow up or large increase in domain size. I have attached the files of the simulations that gave us these errors (I changed the .py extension to .txt, otherwise I couldn't include them in this message). I will check if I also encounter this issue with a smaller example.

We have been looking into the issue ourselves for a while as well. Inducer mentioned in the comment above: "An unsigned integer underflow comes to mind as a possible reason." The only place where I found unsigned integers had to do with particle indexes and are used in e.g. neighbor lists. Since the code runs fine on a CPU and this error is only occurring on a GPU we thought it possibly had to do with the specific implementation of neighbor lists on the GPU. Might it be that neighbor list memory on GPU is not dynamic and that the length of the neighbor lists has a fixed maximum? Say the maximum amount of neighbors in the neighbor list is 30 and at some moment during the simulations the amount of neighbors exceeds this number we might end up in these kind of unsigned integer underflow problems. If there is such a hard cap on the amount of neighbors I could try to change this to a larger number and see if that solves the issue, but I couldn't find anything on that matter.

collision.txt surface_tension_adami.txt

Qcellaris commented 1 year ago

Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better.

Dear Prabhu,

The simulations just continue perfectly fine on a CPU, but when I continue them on a GPU they crash. If it would be a blow up of particles it would crash on CPU as well as on GPU right?

I also don't think it is related to some necessary state that is not saved for the restart, because when we run the simulation from start it also crashes every time at the same point and from there on restarting gives the same error log as is returned during a run straight from the start.

Maybe we can look into this issue in more detail together?

Best,

Stephan

pypr / pysph

Process on GPU killed after long run and/or restart #348