jx-wang-s-group / ppnn

PDE Preserved Neural Network
MIT License
32 stars 11 forks source link

Nan values when running train2D.py #2

Open wowonyang0 opened 1 month ago

wowonyang0 commented 1 month ago

Hello, thank you for your excellent work on ppnn. I’m facing an issue during the training process.

  1. Regarding NaN values: I've been generating 2D Burgers equation training data using 'src/operators.py' with rhs.burgers2Dpu and rhs.burgers2Dpv. When running train2D.py, I encountered NaN values during the evaluation phase (L230–L245 of train2D.py). I observed that both the u and pdeuu values diverge, which leads to NaNs after about 15–20 timesteps (my simulation uses 100 timesteps). To diagnose the issue, I printed out the maximum values for u1, v1, and the results from mcvter.up() within the training loop. Initially, everything looked fine, but I noticed a rapid increase in values starting from the 10th timestep, eventually causing the output to explode into NaN values by the 16th timestep.

To check the maximum values, I added

print(“u1 max value:”, u1.max()) print(“v1 max value:”, v1.max()) between L68 and L69 of 'train2D.py', and added

up_1=mcvter.up(padBC_rd(dt*rhsu(u1,v1,mu,dudx,dudy,d2udx2,d2udy2,ux,uy,dx,dy,dx2,dy2))) up_2=mcvter.up(padBC_rd(dt*rhsv(u1,v1,mu,dudx,dudy,d2udx2,d2udy2,ux,uy,dx,dy,dx2,dy2))) print(“mcvter.up() max value of component 1:”, up_1.max()) print(“mcvter.up() max value of component 2:”, up_2.max())

between L72 and L73 of 'train2D.py'.

And the results were

u max value: tensor(1.1587, device='cuda:0', grad_fn=) 1 u1 max value: tensor(1.0819, device='cuda:0', grad_fn=) v1 max value: tensor(1.1423, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0683, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0555, device='cuda:0', grad_fn=)

u max value: tensor(1.2204, device='cuda:0', grad_fn=) 2 u1 max value: tensor(1.0726, device='cuda:0', grad_fn=) v1 max value: tensor(1.1979, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0683, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0538, device='cuda:0', grad_fn=)

u max value: tensor(1.2837, device='cuda:0', grad_fn=) 3 u1 max value: tensor(1.0656, device='cuda:0', grad_fn=) v1 max value: tensor(1.2539, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0674, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0577, device='cuda:0', grad_fn=)

u max value: tensor(1.3468, device='cuda:0', grad_fn=) 4 u1 max value: tensor(1.0517, device='cuda:0', grad_fn=) v1 max value: tensor(1.3099, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0646, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0631, device='cuda:0', grad_fn=)

u max value: tensor(1.4092, device='cuda:0', grad_fn=) 5 u1 max value: tensor(1.0387, device='cuda:0', grad_fn=) v1 max value: tensor(1.3693, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0624, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0704, device='cuda:0', grad_fn=)

u max value: tensor(1.4632, device='cuda:0', grad_fn=) 6 u1 max value: tensor(1.0301, device='cuda:0', grad_fn=) v1 max value: tensor(1.4330, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0635, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.0945, device='cuda:0', grad_fn=)

u max value: tensor(1.5159, device='cuda:0', grad_fn=) 7 u1 max value: tensor(1.0216, device='cuda:0', grad_fn=) v1 max value: tensor(1.4907, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0641, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.1737, device='cuda:0', grad_fn=)

u max value: tensor(1.5727, device='cuda:0', grad_fn=) 8 u1 max value: tensor(1.0110, device='cuda:0', grad_fn=) v1 max value: tensor(1.5385, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.0863, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.4557, device='cuda:0', grad_fn=)

u max value: tensor(1.6174, device='cuda:0', grad_fn=) 9 u1 max value: tensor(1.0047, device='cuda:0', grad_fn=) v1 max value: tensor(1.5985, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.1642, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(0.9994, device='cuda:0', grad_fn=)

u max value: tensor(1.9205, device='cuda:0', grad_fn=) 10 u1 max value: tensor(0.9978, device='cuda:0', grad_fn=) v1 max value: tensor(1.9188, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.3732, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(2.2415, device='cuda:0', grad_fn=)

u max value: tensor(2.7137, device='cuda:0', grad_fn=) 11 u1 max value: tensor(0.9866, device='cuda:0', grad_fn=) v1 max value: tensor(2.6529, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(0.8836, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(3.5568, device='cuda:0', grad_fn=)

u max value: tensor(2.8847, device='cuda:0', grad_fn=) 12 u1 max value: tensor(1.3529, device='cuda:0', grad_fn=) v1 max value: tensor(2.8229, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(1.7933, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(12.8743, device='cuda:0', grad_fn=)

u max value: tensor(8.5635, device='cuda:0', grad_fn=) 13 u1 max value: tensor(2.1685, device='cuda:0', grad_fn=) v1 max value: tensor(8.1768, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(4.2509, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(48.2277, device='cuda:0', grad_fn=)

u max value: tensor(39.8739, device='cuda:0', grad_fn=) 14 u1 max value: tensor(5.2418, device='cuda:0', grad_fn=) v1 max value: tensor(38.7819, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(30.9800, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(1223.2072, device='cuda:0', grad_fn=)

u max value: tensor(1167.5209, device='cuda:0', grad_fn=) 15 u1 max value: tensor(33.9285, device='cuda:0', grad_fn=) v1 max value: tensor(1129.5868, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(6117.8599, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(405320.7812, device='cuda:0', grad_fn=)

u max value: tensor(404588.8438, device='cuda:0', grad_fn=) 16 u1 max value: tensor(5538.0645, device='cuda:0', grad_fn=) v1 max value: tensor(369550.1562, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(1.8042e+09, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(1.2666e+11, device='cuda:0', grad_fn=)

u max value: tensor(1.2666e+11, device='cuda:0', grad_fn=) 17 u1 max value: tensor(1.7757e+09, device='cuda:0', grad_fn=) v1 max value: tensor(1.2490e+11, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(3.5780e+19, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(2.4685e+21, device='cuda:0', grad_fn=)

u max value: tensor(2.4685e+21, device='cuda:0', grad_fn=) 18 u1 max value: tensor(2.9261e+19, device='cuda:0', grad_fn=) v1 max value: tensor(2.0184e+21, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(nan, device='cuda:0', grad_fn=)

u max value: tensor(nan, device='cuda:0', grad_fn=) 19 u1 max value: tensor(nan, device='cuda:0', grad_fn=) v1 max value: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(nan, device='cuda:0', grad_fn=)

u max value: tensor(nan, device='cuda:0', grad_fn=) 20 u1 max value: tensor(nan, device='cuda:0', grad_fn=) v1 max value: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 1: tensor(nan, device='cuda:0', grad_fn=) mcvter.up() max value of component 2: tensor(nan, device='cuda:0', grad_fn=)

...

Could you help me identify why this divergence happens, and what might be missing in my approach? How can I resolve the NaN issue?

  1. Also, I'm curious about why didn't you multiply dt term on pdeuu in L238 of train2D.py. I mean `u_tmp = padBC_rd(model((u[:,:,:-1,:-1]-inmean)/instd, (mutest-mumean)/mustd, (pdeuu[:,:,:-1,:-1]-pdemean)/pdestd)*outstd + outmean)\
    • u + pdeuu instead of u_tmp = padBC_rd(model((u[:,:,:-1,:-1]-inmean)/instd, (mutest-mumean)/mustd, (pdeuu[:,:,:-1,:-1]-pdemean)/pdestd)*outstd + outmean)\
    • u + *dt ** pdeuu`

Your paper’s schematic diagram (Fig. 1.b) specifies multiplying pdeuu by dt, but this isn't reflected in the code.

Let me know if you'd like any further adjustments!

Xin-yang-Liu commented 2 weeks ago

Thanks for bringing this up. Unfortunately I don't have enough time to work on this but I am happy to provide some suggestions.

  1. The blow-up seems caused by numerical instability. I would suggest you to check the CFL number and tune down the dt or use coarser mesh. If you were using the configuration file provided in this repo, could you please let me know which yaml file you were using?
  2. The dt is already timed inside the function pde_du https://github.com/jx-wang-s-group/ppnn/blob/61800c9f9a6f268c7ce17f127eda40a84135a808/src/train2D.py#L75