igfuw / UWLCM

University of Warsaw Lagrangian Cloud Model
GNU General Public License v3.0
6 stars 13 forks source link

stuck in pressure solver error #92

Open claresinger opened 4 years ago

claresinger commented 4 years ago

I somewhat frequently get the error of stuck in pressure solver (error message below). If I run the same simulation with a different random seed each time this will happen about every 20 runs. Do you know why this might happen? Could it be a glitch on the hpc I'm using and not a bug in the code?

terminate called after throwing an instance of 'std::runtime_error'
  what():  stuck in pressure solver
SIGABRT: abort
PC=0x473e4b m=0 sigcode=0
pdziekan commented 4 years ago

Does this happen every time for a given random seed?

It seems that for some rare conditions the pressure solver has trouble with finding a solution. If it was a bug in e.g. boundary conditions, I don't see why it would depend on the seed.

To make it easier for the pressure solver, you can try to:

This runtime_error is thrown when pressure solver needs more than 10000 iterations. You could test increasing the number of iterations, which is hardcoded in libmpdata++/solvers/detail/mpdata_rhs_vip_prs_common.hpp

trontrytel commented 4 years ago

I also just got stuck in pressure solver. I was running dycoms 2D with rng seed = 42. Will try again now with the same setup to see if it's deterministic or random

trontrytel commented 4 years ago

The below command was stuck 4 times on 2 different GPU nodes.

@pdziekan - could you check if you will also get stuck on your machine? If yes then rng_seed=42 is a good candidate to debug from.

case = "dycoms_rf02"                                                           
nx = "129"                                                                     
ny = "0"                                                                       
nz = "301"                                                                     
dt = "1"                                                                       
nt = "21600"                                                                   
spinup = "3600"                                                                
outfreq = "3600"                                                               
backend = "CUDA"                                                               

outdir = "out_test_lgrngn"                                                     
rng_seed = "42"                                                                

micro = "lgrngn"                                                               
sd_conc = "40"                                                                 
sstp_cond = "10"                                                               
sstp_coal = "10"                                                               

cmd = "OMP_NUM_THREADS=1 ./src/bicycles --outdir="+outdir+" --case="+case+\    
      " --nx="+nx+" --ny=0 --nz="+nz+" --dt="+dt+" --spinup="+spinup+\         
      " --nt="+nt+" --micro="+micro+" --outfreq="+outfreq+\                    
      " --backend="+backend+" --rng_seed="+rng_seed+" --sd_conc="+sd_conc+\    
      " --sstp_cond="+sstp_cond+" --sstp_coal="+sstp_coal                      

print "running " + cmd                                                         
os.system(cmd)
trontrytel commented 4 years ago

The below command was stuck 4 times on 2 different GPU nodes.

@pdziekan - could you check if you will also get stuck on your machine? If yes then rng_seed=42 is a good candidate to debug from.

The same command but with rng_seed = 44 does not get stuck

trontrytel commented 4 years ago

Not sure if its the same issue. This combination gets stuck after time step = 9000 but I don't get any errors from the pressure solver.

case = "dycoms_rf02"                                                           
nx = "129"                                                                     
ny = "0"                                                                       
nz = "301"                                                                     
dt = "1"                                                                       
nt = "25200"                                                                   
spinup = "3600"                                                                
outfreq = "900"                                                                
backend = "CUDA"                                                               

rng_seed = "48"                                                                
outdir = "out_test_lgrngn_"+rng_seed                                           

micro = "lgrngn"                                                               
sd_conc = "512"                                                                
sstp_cond = "10"                                                               
sstp_coal = "10"                                                               

cmd = "OMP_NUM_THREADS=1 ./src/bicycles --outdir="+outdir+" --case="+case+\    
      " --nx="+nx+" --ny=0 --nz="+nz+" --dt="+dt+" --spinup="+spinup+\         
      " --nt="+nt+" --micro="+micro+" --outfreq="+outfreq+\                    
      " --backend="+backend+" --rng_seed="+rng_seed+" --sd_conc="+sd_conc+\    
      " --sstp_cond="+sstp_cond+" --sstp_coal="+sstp_coal+\                    
      " --gccn=1"
trontrytel commented 4 years ago

Not sure if its the same issue. This combination gets stuck after time step = 9000 but I don't get any errors from the pressure solver.

The same with rng_seed=13