quokka-astro / quokka

Two-moment AMR radiation hydrodynamics (with self-gravity, particles, and chemistry) on CPUs/GPUs for astrophysics
https://quokka-astro.github.io/quokka/
MIT License
45 stars 12 forks source link

AMD GPUs produce incorrect results #394

Open psharda opened 12 months ago

psharda commented 12 months ago

For the exact same setup, a Pop III simulation on Setonix GPUs show a very different evolution as compared to CPUs. The initial density projections on both are identical:

plt00000_Projection_z_gasDensity_gasDensity

However, after a few 100 steps, GPU projections look weird whereas CPU projections look more reasonable. GPU projection:

plt00900_Projection_z_gasDensity_gasDensity

CPU projection:

plt00900_Projection_z_gasDensity_gasDensity

GPU run is much faster, but it needs flux correction in several cells in each timestep almost from t = 0, whereas CPU run doesn't need any flux corrections. At some point, the GPU run aborts because of rho = 0 or NAN during regridding.

There seems to be some issue with Setonix GPUs.

BenWibking commented 12 months ago

cc @markkrumholz @aditivijayan

markkrumholz commented 12 months ago

Can you try this on Gadi GPU and see what happens?

markkrumholz commented 12 months ago

Also to test: run one time step only on both Setonix CPU and GPU. The result should be bit-wise identical. Are they, and, if not, are they different in all cells, or just some?

psharda commented 12 months ago

@markkrumholz - the last I tried on Gadi GPUs, they gave a memory error. Lately, the runs on gpuvolta are being put on hold, or do not start for hours. Is there an express gpu queue for testing and debugging on gadi?

BenWibking commented 12 months ago

There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?

psharda commented 12 months ago

After one timestep, the min/max ratio of densities of the CPU to the GPU runs is:

In [17]: np.min(dens_gpu / dens_cpu), np.max(dens_gpu / dens_cpu)
Out[17]:
(unyt_quantity(0.00999067, '(dimensionless)'),
 unyt_quantity(100.00105867, '(dimensionless)'))

But the minimum and maximum densities in both the runs are similar. So, it looks like the ratio ranges from 0.01 - 100 because some core cells have the density of the background and vice-versa between the two runs.

psharda commented 12 months ago

There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?

Is there a queue name for it? Or, is this the project with Yuan Sen as the PI? I don't think I'm a part of it.

BenWibking commented 12 months ago

fcompare might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp

You can compile it by just typing make inside extern/amrex/Tools/Plotfile.

BenWibking commented 12 months ago

There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?

Is there a queue name for it? Or, is this the project with Yuan Sen as the PI? I don't think I'm a part of it.

The latter. Does this work:

#!/bin/bash

#PBS -N quokka_run
#PBS -P dg97
#PBS -q gpursaa
#PBS -l ncpus=14
#PBS -l ngpus=1
#PBS -l mem=50GB
#PBS -m aeb
#PBS -l wd
#PBS -l walltime=1:00:00
#PBS -l storage=scratch/jh2+gdata/jh2
#PBS -l jobfs=50GB

# --- initialize Quokka ---
MPI_OPTIONS="-np $PBS_NGPUS --map-by numa:SPAN --bind-to numa --mca pml ucx"
echo "Using MPI_OPTIONS = $MPI_OPTIONS"

mpirun $MPI_OPTIONS ./popiii popii.in
psharda commented 12 months ago

Yeah, I'm not a part of dg97

psharda commented 12 months ago
image
psharda commented 12 months ago

fcompare might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp

You can compile it by just typing make inside extern/amrex/Tools/Plotfile.

fgradient.cpp:9:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
make: *** [../../Tools/GNUMake/Make.rules:89: fgradient.gnu.x86-milan.ex] Error 1
BenWibking commented 12 months ago

oh weird, I just checked my email, and I was also removed from dg97...

psharda commented 12 months ago

Maybe because someone realized we are not at RSAA anymore? ;)

BenWibking commented 12 months ago

fcompare might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp You can compile it by just typing make inside extern/amrex/Tools/Plotfile.

fgradient.cpp:9:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
make: *** [../../Tools/GNUMake/Make.rules:89: fgradient.gnu.x86-milan.ex] Error 1

Can you replace that line with:

#if __has_include(<filesystem>)
#include <filesystem>
#elif __has_include(<experimental/filesystem>)
#include <experimental/filesystem>
namespace std
{
namespace filesystem = experimental::filesystem;
}
#endif
markkrumholz commented 12 months ago

I’ll check with Yuan-Sen about dg97 when I get to the office in an hour. --Mark KrumholzTyped on a tiny keyboardOn Sep 22, 2023, at 8:35 AM, Ben Wibking @.***> wrote: oh weird, I just checked my email, and I was also removed from dg97...

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

psharda commented 12 months ago
psharda@setonix-02:/scratch/pawsey0807/psharda/quokka/cpubuild> ./fcompare.gnu.x86-milan.ex plt00001 ../gpubuild/plt00001

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
amrex::Abort::0::ERROR: grids do not match !!!
SIGABRT
See Backtrace.0 file for details

I get the same error for 0th plt files too. Why wont the grids match?

markkrumholz commented 12 months ago

Grid generation is not independent of the number of MPI ranks, so if the number of ranks you've used is different on the CPU vs. GPU test, you won't necessarily get the same grid layout. For the purposes of this test, you might also try turning off AMR and seeing if, when you use a completely flag grid structure, the differences between CPU and GPU persist.

I have emailed Yuan-Sen and asked for you two to be added back to dg97.

BenWibking commented 12 months ago

You can allow different grids:

https://github.com/AMReX-Codes/amrex/blob/b2052f2d2e5ce44317450ba13de705a3e01ef0ea/Tools/Plotfile/fcompare.cpp#L35

psharda commented 11 months ago

I re-joined dg97

scratch on Setonix is down. Once it's back up, I'll be able to fcompare again

psharda commented 11 months ago

@BenWibking @markkrumholz something is definitely up with Setonix GPUs. Here are the density projections for a timestep from Gadi CPU and GPU:

  1. Gadi CPU: plt00750_Projection_z_gasDensity_gasDensity

  2. Gadi GPU: plt00750_Projection_z_gasDensity_gasDensity

BenWibking commented 11 months ago

Can you fcompare between these two?

psharda commented 11 months ago
ps3459@gadi-login-08:/scratch/jh2/ps3459/quokka/build>./fcompare.gnu.ex -a plt00750 ../pold/plt00750

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 gasDensity                         1.438428927e-33           3.088436355e-14
 x-GasMomentum                      1.932709218e-28           2.786481368e-14
 y-GasMomentum                      1.522501547e-28           2.262997499e-14
 z-GasMomentum                      2.358694107e-28           3.252484972e-14
 gasEnergy                          3.019209236e-23           1.028886349e-14
 gasInternalEnergy                  2.543580384e-23           9.592771651e-15
 scalar_0                           5.437038042e-43           3.174728929e-14
 scalar_1                           9.987110207e-40           3.176040866e-14
 scalar_2                           1.086344859e-33           3.057775805e-14
 scalar_3                           6.260684057e-47           3.216521655e-14
 scalar_4                           5.149771856e-44           3.300415905e-14
 scalar_5                           6.612155723e-38           3.109016504e-14
 scalar_6                           1.357764473e-48           3.276051438e-14
 scalar_7                           5.136719677e-51           3.265735903e-14
 scalar_8                           8.963144425e-37           3.237877791e-14
 scalar_9                           1.298535589e-52           3.312037584e-14
 scalar_10                          2.712016996e-40             3.1961632e-14
 scalar_11                          1.795210709e-83           3.185581266e-14
 scalar_12                           1.09031628e-76            3.17798777e-14
 scalar_13                          3.535887007e-34           3.209328795e-14
 temperature                        6.959453458e-09           2.321962785e-13
 velx                                1.21886842e-07           2.708626196e-13
 pressure                           1.695720256e-23           9.595083403e-15
 sound_speed                        2.512242645e-07           1.359843815e-13
BenWibking commented 11 months ago

Excellent, machine precision differences, as expected.

We should not do any production calculations on Setonix GPUs until this is fixed.

psharda commented 11 months ago

Do we discuss/report this to Setonix people?

BenWibking commented 11 months ago

Do we discuss/report this to Setonix people?

Yes, please open a help ticket and include a link to this GitHub issue.

BenWibking commented 11 months ago

If anyone still feels like debugging this, it might be worth checking whether this problem is fixed by disabling GPU-aware MPI by setting amrex.use_gpu_aware_mpi=0 (https://amrex-codes.github.io/amrex/docs_html/GPU.html#inputs-parameters).

BenWibking commented 11 months ago

If anyone still feels like debugging this, it might be worth checking whether this problem is fixed by disabling GPU-aware MPI by setting amrex.use_gpu_aware_mpi=0 (https://amrex-codes.github.io/amrex/docs_html/GPU.html#inputs-parameters).

In the .in file?

Yes, this can be set there, or on the command line (i.e. srun ./popiii input_file.in amrex.use_gpu_aware_mpi=0)

psharda commented 11 months ago

Ah, so if we do this, we shouldnt do export MPICH_GPU_SUPPORT_ENABLED=1 in the setonix-1node.submit script.

psharda commented 11 months ago

I used export MPICH_GPU_SUPPORT_ENABLED=0 in addition to amrex.use_gpu_aware_mpi=0. Run on setonix gpus still crashed after 60 steps, with the same flux correction and rho being NAN errors.

BenWibking commented 11 months ago

I used export MPICH_GPU_SUPPORT_ENABLED=0 in addition to amrex.use_gpu_aware_mpi=0. Run on setonix gpus still crashed after 60 steps, with the same flux correction and rho being NAN errors.

Well, I guess that's not it, then...

We might have to wait until the AMD GPU node is set up at RSAA and test this on an independent system with an up-to-date version of the AMD software stack.

BenWibking commented 11 months ago

@psharda Would it be possible to shrink this problem to fit on a single Setonix GPU and test that? Then we would also have a test case ready to run on the Avatar AMDGPU node.

This would also help isolate whether it's an MPI bug or something wrong with the GPU computation itself.

psharda commented 11 months ago

I mean, I only used 1 setonix node to run these. So I guess that was 8 GPUs? I could try and run it on 1 GPU only.

psharda commented 11 months ago

Setonix doesn't like this, do you know why?

#!/bin/bash

#SBATCH -A pawsey0807-gpu
#SBATCH -J quokka_benchmark
#SBATCH -o 1node_%x-%j.out
#SBATCH -t 03:10:00
#SBATCH -p gpu-dev
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
##SBATCH --core-spec=8
#SBATCH -N 1
srun: error: nid003002: task 0: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=4670973.0
BenWibking commented 11 months ago

Hmm, no. Can you try the "Example 1: One process with a single GPU using shared node access" script here: https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start#SetonixGPUPartitionQuickStart-Compilingsoftware

psharda commented 11 months ago

Nope, didn't work. I feel like I'm missing something trivial. Can you give it a try?

BenWibking commented 11 months ago

I am able to run the HydroBlast3D problem with this script:

#!/bin/bash --login
#SBATCH --account=pawsey0807-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation packs" in total for the job)
#SBATCH --time=00:20:00

# load modules
module load craype-accel-amd-gfx90a
module load rocm/5.2.3

srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./build/src/HydroBlast3D/test_hydro3d_blast tests/blast_unigrid_256.in
BenWibking commented 11 months ago

All of the tests pass, except for the ChannelFlow test, which fails due to a compiler bug.

psharda commented 11 months ago

Tried the script

I am able to run the HydroBlast3D problem with this script:

#!/bin/bash --login
#SBATCH --account=pawsey0807-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation packs" in total for the job)
#SBATCH --time=00:20:00

# load modules
module load craype-accel-amd-gfx90a
module load rocm/5.2.3

srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./build/src/HydroBlast3D/test_hydro3d_blast tests/blast_unigrid_256.in

Tried the script. PopIII sim still fails after 67 steps.

BenWibking commented 11 months ago

That's unfortunate. This has all the hallmarks of a compiler/driver bug... hopefully the AMDGPU testing node will be up and running soon.

BenWibking commented 10 months ago

It would be useful to re-test on Setonix with export HSA_ENABLE_SDMA=0 set. There are known bugs in the GPU driver than can cause incorrect results when this is set to 1 (which is the default).

BenWibking commented 10 months ago

@psharda for reference, can you post here the error you get when running on Moth?

BenWibking commented 10 months ago

Possibly related to: https://github.com/AMReX-Codes/amrex/issues/3623

BenWibking commented 7 months ago

@psharda Weiqun suggested adding amrex::Gpu::streamSynchronize() immediately after the chemistry ParallelFor here: https://github.com/quokka-astro/quokka/blob/2d863d655ec1e1f684f55ea745021b9c97a03c60/src/Chemistry.hpp#L144

That fixed the memory crashes with Castro on AMD GPUs when doing nuclear reactions.

BenWibking commented 6 months ago

The memory errors are fixed by adding-mllvm -amdgpu-function-calls=true to the compiler flags.

This works around AMDGPU compiler bugs when building very large kernels (the primordial chemistry network, the larger nuclear networks).

There is still a real bug (at least, it also appears on CPU runs) where, at timestep 58, FOFC happens and then then VODE fails to integrate the network (from @psharda):

Coarse STEP 58 at t = 2277615624 (2.277615624e-05%) starts ...
    >> Using global timestep on this coarse step (estimated work ratio: 1).
[Level 0 step 58] ADVANCE with time = 2277615624 dt = 228761562.4
[FOFC-1] flux correcting 1 cells on level 0
Coordinates: (-1.215046875e+18, -2.719390625e+18, 8.67890625e+17):  At cell (21,8,39) in Box ((0,0,32) (31,31,63) (0,0,0)): -7.6231090543352448e-24, -1.069268058649304e-17, 7.410320892898785e-18, 3.7609666350373797e-18, 5.3526763892812115e-10, 5.2902546443125309e-10, -2.8030653010662632e-33, -5.1467865827886286e-30, -5.8149295047562081e-24, -3.1857928751784568e-37, -2.5538870986621986e-34, -3.4809871986548917e-28, -6.7835317008872414e-39, -2.5744653822924568e-41, -4.5308774871340948e-27, -6.4171286356493599e-43, -1.3888196456542485e-30, -9.2237860360589853e-74, -5.6154994834351863e-67, -1.8032940347070311e-24
[FOFC-2] flux correcting 1 cells on level 0
Coordinates: (-1.215046875e+18, -2.719390625e+18, 8.67890625e+17):  At cell (21,8,39) in Box ((0,0,32) (31,31,63) (0,0,0)): -4.4439055516774191e-24, -9.7063852770399433e-18, 9.2891283062859141e-18, 3.8792162629719635e-18, 5.3612761960843476e-10, 5.2929401031211604e-10, -1.6338050081804726e-33, -2.9998038185290005e-30, -3.3898238659059865e-24, -1.8648566265595356e-37, -1.4894234044896535e-34, -2.0292483481208408e-28, -3.98069827049991e-39, -1.5061288633265628e-41, -2.6413500902039217e-27, -3.7645391218040064e-43, -8.0955352015117408e-31, -5.3722206263769706e-74, -3.102252662899324e-67, -1.0512335997061365e-24
DVODE: corrector convergence failed repeatedly or with abs(H) = HMIN
[ERROR] integration failed in net
istate = -6
zone = (0, 0, 0)
time = 0
dt = 114380781.1951238
temp start = 84.62999999999991
xn start = 0.007969664777026481 0.007969467582489294 8999.891176964589 4.929760939497877e-10 1.977453278654315e-07 0.269463080239167 5.254014144099686e-12 1.992932720226845e-14 3.506310230602529 3.313755276663732e-16 0.0007166261480667435 3.569520310420554e-47 2.119713584959603e-40 697.9645364035052
dens current = 1.974508903110628e-20
temp current = -nan
xn current = 0.007969664777026481 0.007969467582489294 8999.891176964589 4.929760939497877e-10 1.977453278654315e-07 0.269463080239167 5.254014144099686e-12 1.992932720226845e-14 3.506310230602529 3.313755276663732e-16 0.0007166261480667435 3.569520310420554e-47 2.119713584959603e-40 697.9645364035052
energy generated = 8676380294.558048
amrex::Abort::4::VODE integration was unsuccessful! !!!
BenWibking commented 1 month ago

For future reference, the underlying problem is that the AMDGPU compiler is fundamentally broken in how it spills registers to memory: https://discourse.llvm.org/t/the-current-state-of-spilling-function-calls-and-related-problems/2863

I think we just can't use AMD GPUs for chemistry until they rewrite how their compiler does register allocation.