blk1m all NaN after first timestep

claresinger commented 1 year ago

I was trying to run the vanilla blk1m dycoms-rf02 case for debugging the blk2m actually, but the output is all NaN. The initial condition is not NaN, so the first timestep0000.h5 file looks fine (see plot). But all future files, all fields are NaN. Below is the exact command I'm calling to run the model. This shouldn't be a problem with the singularity image I'm using or my cluster because the blk2m cases I'm running are just fine. There are no error messages that get printed.

/home/csinger/microphys/UWLCM/build/uwlcm --outdir=/home/csinger/microphys/UWLCM_output/dycoms_blk1m/dycoms_default/exp1 --case=dycoms_rf02 --nx=129 --ny=0 --nz=301 --dt=1 --spinup=300 --nt=900 --outfreq=100 --rng_seed=0 --serial=1 --backend=serial --sgs=0 --micro=blk_1m  --rc_src=1 --rr_src=1

pdziekan commented 1 year ago

@claresinger I will try to run and debug it

pdziekan commented 1 year ago

NaNs appear only in Release mode, not in Debug nor RelWithDebInfo. Release uses the -Ofast optimization, which can make math inaccurate. I will try to find the exact computation that fails, but for now use RelWithDebInfo (it give reasonable performance with -O3). When configuring UWLCM, use cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo.

BTW I recommend using couple threads to make it run faster, e.g. for 4 threads:

OMP_NUM_THREADS=4 /home/csinger/microphys/UWLCM/build/uwlcm --outdir=/home/csinger/microphys/UWLCM_output/dycoms_blk1m/dycoms_default/exp1 --case=dycoms_rf02 --nx=129 --ny=0 --nz=301 --dt=1 --spinup=300 --nt=900 --outfreq=100 --rng_seed=0 --serial=0 --sgs=0 --micro=blk_1m --rc_src=1 --rr_src=1

claresinger commented 1 year ago

Not getting NaN when I compile in debug mode, but the first two runs I tried failed as soon as the spin up period ended with this same error...

    299 A negative number -2.18828e-07 detected in: rc after first half of rhs
    300 CHEATING: turning negative values to small positive values
    ...
    901 A not-finite number detected in: RHS of rc after rc_src
    902 (-2,126) x (-2,298)
    903 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6.05611e-16 3.98759e-12 2.27383e-13 3.32171e-10         3.65113e-09 3.13788e-09 2.68132e-09 3.25299e-09 3.26992e-09 4.20334e-09 3.7802e-09 2.28362e-09 -1.69619e-11 3.62539e-09 4.35073e-09 5.16453e-09 3.7283e-09 2.17633e-09 2.88774e-09 1.42543e-09 5.52927e-09 6.50989        e-09 3.40651e-09 3.31003e-09 4.20906e-09 4.17311e-09 2.15725e-09 3.81626e-09 3.03368e-09 4.95442e-09 6.0601e-09 2.95281e-09 5.66372e-09 2.78304e-09 6.49154e-09 7.43002e-09 -8.24271e-10 4.52873e-09 7.58061e-09 4.        59449e-09 4.00976e-09 6.21262e-09 2.84921e-09 1.04071e-09 7.57597e-09 2.58771e-09 1.51047e-09 6.36718e-09 5.78102e-09 3.3953e-09 4.39388e-09 8.3508e-09 6.11775e-09 1.57571e-09 5.941e-09 -7.44315e-09 6.44774e-09         -2.80763e-08 -2.74393e-08 -2.59706e-08 -4.97819e-08 -6.44901e-08 -6.76267e-08 -8.64436e-08 -8.86416e-08 -9.72918e-08 -1.09204e-07 -1.08617e-07 -1.23585e-07 -1.42612e-07 -1.63702e-07 -1.8782e-07 -2.00461e-07 -2.2        4821e-07 -2.73575e-07 -3.37978e-07 -4.96896e-07 -nan -1.17185e-07 1.34029e-11 6.57347e-15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0         0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    ...
    1032 uwlcm: /home/csinger/microphys/UWLCM/src/solvers/common/../../detail/checknan.cpp:46: void nancheck_hlprs::nancheck_hlpr(const arr_t&, const string&) [with arr_t = blitz::Array<float, 2>; std::string = std::__cx        x11::basic_string<char>]: Assertion `0' failed.

pdziekan commented 1 year ago

Fixed by #158

igfuw / UWLCM

blk1m all NaN after first timestep #157