Nan values when running train2D.py

Hello, thank you for your excellent work on ppnn. I’m facing an issue during the training process.

Regarding NaN values: I've been generating 2D Burgers equation training data using 'src/operators.py' with rhs.burgers2Dpu and rhs.burgers2Dpv. When running train2D.py, I encountered NaN values during the evaluation phase (L230–L245 of train2D.py). I observed that both the u and pdeuu values diverge, which leads to NaNs after about 15–20 timesteps (my simulation uses 100 timesteps). To diagnose the issue, I printed out the maximum values for u1, v1, and the results from mcvter.up() within the training loop. Initially, everything looked fine, but I noticed a rapid increase in values starting from the 10th timestep, eventually causing the output to explode into NaN values by the 16th timestep.

To check the maximum values, I added

print(“u1 max value:”, u1.max()) print(“v1 max value:”, v1.max()) between L68 and L69 of 'train2D.py', and added

up_1=mcvter.up(padBC_rd(dt*rhsu(u1,v1,mu,dudx,dudy,d2udx2,d2udy2,ux,uy,dx,dy,dx2,dy2))) up_2=mcvter.up(padBC_rd(dt*rhsv(u1,v1,mu,dudx,dudy,d2udx2,d2udy2,ux,uy,dx,dy,dx2,dy2))) print(“mcvter.up() max value of component 1:”, up_1.max()) print(“mcvter.up() max value of component 2:”, up_2.max())

between L72 and L73 of 'train2D.py'.

And the results were

u max value: tensor(1.1587, device='cuda:0', grad_fn=) 1 u1 max value: tensor(1.0819, device='cuda:0', grad_fn=) v1 max value: tensor(1.1423, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0683, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0555, device='cuda:0', grad_fn=)

u max value: tensor(1.2204, device='cuda:0', grad_fn=) 2 u1 max value: tensor(1.0726, device='cuda:0', grad_fn=) v1 max value: tensor(1.1979, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0683, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0538, device='cuda:0', grad_fn=)

u max value: tensor(1.2837, device='cuda:0', grad_fn=) 3 u1 max value: tensor(1.0656, device='cuda:0', grad_fn=) v1 max value: tensor(1.2539, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0674, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0577, device='cuda:0', grad_fn=)

u max value: tensor(1.3468, device='cuda:0', grad_fn=) 4 u1 max value: tensor(1.0517, device='cuda:0', grad_fn=) v1 max value: tensor(1.3099, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0646, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0631, device='cuda:0', grad_fn=)

u max value: tensor(1.4092, device='cuda:0', grad_fn=) 5 u1 max value: tensor(1.0387, device='cuda:0', grad_fn=) v1 max value: tensor(1.3693, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0624, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0704, device='cuda:0', grad_fn=)

u max value: tensor(1.4632, device='cuda:0', grad_fn=) 6 u1 max value: tensor(1.0301, device='cuda:0', grad_fn=) v1 max value: tensor(1.4330, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0635, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0945, device='cuda:0', grad_fn=)

u max value: tensor(1.5159, device='cuda:0', grad_fn=) 7 u1 max value: tensor(1.0216, device='cuda:0', grad_fn=) v1 max value: tensor(1.4907, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0641, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.1737, device='cuda:0', grad_fn=)

u max value: tensor(1.5727, device='cuda:0', grad_fn=) 8 u1 max value: tensor(1.0110, device='cuda:0', grad_fn=) v1 max value: tensor(1.5385, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0863, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.4557, device='cuda:0', grad_fn=)

u max value: tensor(1.6174, device='cuda:0', grad_fn=) 9 u1 max value: tensor(1.0047, device='cuda:0', grad_fn=) v1 max value: tensor(1.5985, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.1642, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.9994, device='cuda:0', grad_fn=)

u max value: tensor(1.9205, device='cuda:0', grad_fn=) 10 u1 max value: tensor(0.9978, device='cuda:0', grad_fn=) v1 max value: tensor(1.9188, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.3732, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(2.2415, device='cuda:0', grad_fn=)

u max value: tensor(2.7137, device='cuda:0', grad_fn=) 11 u1 max value: tensor(0.9866, device='cuda:0', grad_fn=) v1 max value: tensor(2.6529, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.8836, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(3.5568, device='cuda:0', grad_fn=)

u max value: tensor(2.8847, device='cuda:0', grad_fn=) 12 u1 max value: tensor(1.3529, device='cuda:0', grad_fn=) v1 max value: tensor(2.8229, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(1.7933, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(12.8743, device='cuda:0', grad_fn=)

u max value: tensor(8.5635, device='cuda:0', grad_fn=) 13 u1 max value: tensor(2.1685, device='cuda:0', grad_fn=) v1 max value: tensor(8.1768, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(4.2509, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(48.2277, device='cuda:0', grad_fn=)

u max value: tensor(39.8739, device='cuda:0', grad_fn=) 14 u1 max value: tensor(5.2418, device='cuda:0', grad_fn=) v1 max value: tensor(38.7819, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(30.9800, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(1223.2072, device='cuda:0', grad_fn=)

u max value: tensor(1167.5209, device='cuda:0', grad_fn=) 15 u1 max value: tensor(33.9285, device='cuda:0', grad_fn=) v1 max value: tensor(1129.5868, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(6117.8599, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(405320.7812, device='cuda:0', grad_fn=)

u max value: tensor(404588.8438, device='cuda:0', grad_fn=) 16 u1 max value: tensor(5538.0645, device='cuda:0', grad_fn=) v1 max value: tensor(369550.1562, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(1.8042e+09, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(1.2666e+11, device='cuda:0', grad_fn=)

u max value: tensor(1.2666e+11, device='cuda:0', grad_fn=) 17 u1 max value: tensor(1.7757e+09, device='cuda:0', grad_fn=) v1 max value: tensor(1.2490e+11, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(3.5780e+19, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(2.4685e+21, device='cuda:0', grad_fn=)

u max value: tensor(2.4685e+21, device='cuda:0', grad_fn=) 18 u1 max value: tensor(2.9261e+19, device='cuda:0', grad_fn=) v1 max value: tensor(2.0184e+21, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(nan, device='cuda:0', grad_fn=)

u max value: tensor(nan, device='cuda:0', grad_fn=) 19 u1 max value: tensor(nan, device='cuda:0', grad_fn=) v1 max value: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(nan, device='cuda:0', grad_fn=)

u max value: tensor(nan, device='cuda:0', grad_fn=) 20 u1 max value: tensor(nan, device='cuda:0', grad_fn=) v1 max value: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(nan, device='cuda:0', grad_fn=)

...

Could you help me identify why this divergence happens, and what might be missing in my approach? How can I resolve the NaN issue?

Also, I'm curious about why didn't you multiply dt term on pdeuu in L238 of train2D.py. I mean `u_tmp = padBC_rd(model((u[:,:,:-1,:-1]-inmean)/instd, (mutest-mumean)/mustd, (pdeuu[:,:,:-1,:-1]-pdemean)/pdestd)*outstd + outmean)\
- u + pdeuu instead of u_tmp = padBC_rd(model((u[:,:,:-1,:-1]-inmean)/instd, (mutest-mumean)/mustd, (pdeuu[:,:,:-1,:-1]-pdemean)/pdestd)*outstd + outmean)\
- u + *dt ** pdeuu`

Your paper’s schematic diagram (Fig. 1.b) specifies multiplying pdeuu by dt, but this isn't reflected in the code.

Let me know if you'd like any further adjustments!

jx-wang-s-group / ppnn

Nan values when running train2D.py #2