JDFTx crash on Jetstream2 GPUs (every job): Signal code: Integer divide-by-zero (1)

ColinBundschu commented 1 month ago

While running a calculation that has subsequently worked on cpus, the GPU version of JDFTx crashed. Files and output: dft_files.zip

[gpu-sm00:26622] Process received signal [gpu-sm00:26622] Signal: Floating point exception (8) [gpu-sm00:26622] Signal code: Integer divide-by-zero (1) [gpu-sm00:26622] Failing at address: 0x7c7062a23b4c [gpu-sm00:26622] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7c7061642520] [gpu-sm00:26622] [ 1] /home/exouser/jdftx/build/libjdftx_gpu.so(_Z7ceildivIiET_S0S0+0x1a)[0x7c7062a23b4c] [gpu-sm00:26622] [ 2] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN17GpuLaunchConfig1DC2IFviii7vector3IdEPKS1_IiE7matrix3IdEPKS2_15RadialFunctionGP7complexEEEPT_i+0x73)[0x7c70631fd53b] [gpu-sm00:26622] [ 3] /home/exouser/jdftx/build/libjdftx_gpu.so(_Z7Vnl_gpuILi0ELi0EEviii7vector3IdEPKS0_IiE7matrix3IdEPKS1_RK15RadialFunctionGP7complexS8_i+0x37b)[0x7c70631e8692] [gpu-sm00:26622] [ 4] /home/exouser/jdftx/build/libjdftx_gpu.so(_Z7Vnl_gpuiiiii7vector3IdEPKS_IiE7matrix3IdEPKS0_RK15RadialFunctionGP7complexS7_i+0xf2)[0x7c70631c0f3b] [gpu-sm00:26622] [ 5] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZNK11SpeciesInfo4getVERK12ColumnBundlePK7vector3IdEi+0x3e5)[0x7c7062e50b4d] [gpu-sm00:26622] [ 6] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZNK11SpeciesInfo14augmentOverlapERK12ColumnBundleRS0_P6matrix+0x17d)[0x7c7062e33a1b] [gpu-sm00:26622] [ 7] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZNK7IonInfo14augmentOverlapERK12ColumnBundleRS0_PSt6vectorI6matrixSaIS5_EE+0x9b)[0x7c7062d82bdb] [gpu-sm00:26622] [ 8] /home/exouser/jdftx/build/libjdftx_gpu.so(_Z1ORK12ColumnBundlePSt6vectorI6matrixSaIS3_EE+0x8e)[0x7c7062b5fb77] [gpu-sm00:26622] [ 9] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN13ElecMinimizer9constrainER12ElecGradient+0xbb)[0x7c7062ca22f3] [gpu-sm00:26622] [10] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN11MinimizableI12ElecGradientE8minimizeERK14MinimizeParams+0x1c9)[0x7c7062c76aa1] [gpu-sm00:26622] [11] /home/exouser/jdftx/build/libjdftx_gpu.so(_Z12elecMinimizeR10Everything+0x14c)[0x7c7062ca30ad] [gpu-sm00:26622] [12] /home/exouser/jdftx/build/libjdftx_gpu.so(_Z17elecFluidMinimizeR10Everything+0x64d)[0x7c7062ca37ba] [gpu-sm00:26622] [13] /home/exouser/jdftx/build/libjdftx_gpu.so(ZN14IonicMinimizer7computeEP13IonicGradientS1+0xee)[0x7c7062da9898] [gpu-sm00:26622] [14] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN14MinimizeLinmin16linminCubicWolfeI13IonicGradientEEbR11MinimizableIT_ERK14MinimizeParamsRKS3_dRdSB_RS3SC+0x209)[0x7c7062dafd32] [gpu-sm00:26622] [15] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN11MinimizableI13IonicGradientE5lBFGSERK14MinimizeParams+0xbbf)[0x7c7062dae599] [gpu-sm00:26622] [16] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN11MinimizableI13IonicGradientE8minimizeERK14MinimizeParams+0xc6)[0x7c7062dab5b0] [gpu-sm00:26622] [17] /home/exouser/jdftx/build/libjdftx_gpu.so(_ZN14IonicMinimizer8minimizeERK14MinimizeParams+0x27)[0x7c7062daa955] [gpu-sm00:26622] [18] /home/exouser/jdftx/build/jdftx_gpu(+0x50969)[0x649b20291969] [gpu-sm00:26622] [19] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7c7061629d90] [gpu-sm00:26622] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7c7061629e40] [gpu-sm00:26622] [21] /home/exouser/jdftx/build/jdftx_gpu(+0x47e95)[0x649b20288e95] [gpu-sm00:26622] End of error message

ColinBundschu commented 1 month ago

It seems the issue goes away if I use double the VRAM (This is on a flexible NVIDIA grid setup). That said I monitored it closely and I never exceed 75% VRAM usage on the smaller GPU. I don't know what could be causing this issue, but I would like to use the smallest GPU possible for cost reasons and I don't understand why 75% utilization would be a problem.

shankar1729 commented 1 month ago

You are not using a memory pool in this case. Do you get the same error if you set a memory pool size close to the VRAM size you request (using the smaller one)?

Also, the error is happening pretty late into the calculation: doe sit always happen at exactly the same spot if your repeat it? Your errors with the smaller VRAM size may be because you are sharing the GPU with other jobs and that's leading to bad interactions between jobs. When you request a higher VRAM size, you may be getting the GPU exclusively.

ColinBundschu commented 1 month ago

Hmm I thought I was setting the memory pool environment variable. Not sure why it didn't take. I think the way I set up my sbatch script it must not be inheriting environment variables correctly. I will fix this, but so that I can check that it worked, how did you determine that I was not using it?

Regarding the place in the calculation where the error happens: On every single calc, it always happens after the printout of the Nonlinear screening at the start of Electronic minimization. My guess is there is a memory spike there that is tipping it over some threshold.

-------- Electronic minimization -----------
    Nonlinear fluid (bulk dielectric constant: 78.4) occupying 0.524134 of unit cell
    Nonlinear screening (bulk screening length: 5.74355 bohrs) occupying 0.524134 of unit cell

Regarding sharing the GPU, there are 4 sizes available on Jetstream2 which slice up the 40GB VRAM A100. The small nodes (8GB VRAM) are too small to consistently be used without going OOM except on certain smaller jobs. The medium nodes (10 GB VRAM, 1/4 of the GPU) never exceed 8.5 GB VRAM usage according to my logs, so they should be good to go in theory. I am tracking this by dumping nvidia-smi to a file every 1 second while it runs. However despite nominally appearing to be sufficient, these medium nodes consistently produce the issue I reported here. In contrast, the large nodes (20 GB VRAM, 1/2 of the GPU) work well and the issue does not appear, despite the fact that they are also shared GPUs. I assume that there simply enough VRAM clearance to make whatever effect is happening not an issue.

I additionally have been informed since originally posting this by the Jetstream2 that the small and medium nodes will be decommissioned, so I think this might be a non-issue now since I have to use the large nodes anyway.

ColinBundschu commented 1 month ago

Also I have been trying to understand a strange behavior. It would seem that in cases where the job can run on all 4 node sizes, it always uses exactly the same utilization % (according to nvidia-smi) and total wall time for the calculation regardless of what node it is. In these test cases the utilization was between 40-60% while running. This baffles me and I cannot make heads or tails of it. Have you ever seen this kind of behavior?

shankar1729 commented 1 month ago

For the memory pool size, there will be a line in the header of the log file that reports the pool size, if set. (Just after the number of CPUs and GPUs.)

In the example you sent, it did not happen the first time it went into an electronic minimize, but after many ionic steps / wolfe attempts. I meant to ask if it's always happening after a fixed number of ionic steps.

I'm not surprised about the removal of the smaller shares. I've seen numerous issues with shared-GPU runs, even when jobs use single GPUs entirely and only share a node, due to nvidia driver interaction issues. I was surprised to hear that this resource was deployed with sharing within GPUs. I suspect the errors may be stemming from how the setup is forcing the GPU to be split, and makes it more likely with the split into more number of smaller shares.

Finally, the utilization is referring to the GPU compute utilization, not memory. Your current job may not be able to saturate more than 40-60% of the A100. You can reach 100% utilization for large calculations that are dominated by wavefunction orthornormalize (BLAS3) operations, but likely saturate lower for FFT-dominated smaller calculations.

ColinBundschu commented 1 month ago

Right so my confusion at the utilization is not about the memory/cores, its about the fact that on a shared resource, wouldn't the cores be proportionately scaled too? Otherwise there would be a serious issue with resource contention if different fractional users exceeded their fractional usage. I have noticed that, for example, doubling up JDFTx jobs on a single GPU and saturating it is slower than running them serial. Wouldn't such a setup be seriously prone to this problem?

Also the issue I posted here does not seem to happen after a fixed number of ionic steps. Some jobs it happens right away, some run for many steps before crashing. That said, I do not know if the same job run fresh will crash at the same point both times. I suspect it would not, because running the job from the wavefunctions after a crash does not immediately crash. My theory is that its basically a memoryless/Poisson event.

ColinBundschu commented 1 month ago

Wow the utilization is markedly higher with the mempool pre-allocated. That was a big win, thank you!

Every 0.5s: nvidia-smi                                                                             gpu-lg11: Fri Sep 27 02:08:28 2024

Fri Sep 27 02:08:28 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-20C                 On  | 00000000:00:06.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |  18547MiB / 20480MiB |     70%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    614348      C   /home/exouser/jdftx/build/jdftx_gpu       18511MiB |
+---------------------------------------------------------------------------------------+

shankar1729 commented 1 month ago

Good to see that: the low utilization without the pool is most likely due due to time wasted by the CUDA driver on memory allocations. As for the previous question on the fractional usage when the GPU is shared, I'm not familiar with how the resource is being split and what exactly nvidia-smi is reporting for the split GPU, or is it still for a percent of the full GPU?

ColinBundschu commented 4 weeks ago

I did some digging, and apparently its the whole GPU. 😱 So they have no way of preventing contention, it would seem.

shankar1729 commented 3 weeks ago

That's kind of what I was expecting. I suspect there's not much to do here other than avoid GPU sharing when possible.

(Brief rant: the AI boom is starting to ruin GPUs for HPC now, after a big boost to their capabilities.)

ColinBundschu commented 3 weeks ago

Could you elaborate on that rant more for your curious audience?

shankar1729 commented 3 weeks ago

My gripes that I can remember for now:

Ridiculous pricing due to what AI companies are willing to pay; compute/$ is not clearly better on GPUs vs CPUs for many applications.
Emphasis on low-precision arithmetic: touted hardware speed-ups often focus on FP16 etc. for AI, which are mostly pointless for HPC.
Deployment of GPU clusters primarily for more memory-intensive but less compute-intensive AI workloads, which is what this split-GPU scheme seems to fit under.

ColinBundschu commented 2 weeks ago

Ah that makes a lot of sense. Yeah AI bubbles definitely seem to come and go with a unique spin each time, and GPUs seem to be the flavor of the month for this one.

shankar1729 / jdftx

JDFTx crash on Jetstream2 GPUs (every job): Signal code: Integer divide-by-zero (1) #353