lanl / palm_lanl

LANL Contributions to PArallelized Large-eddy simulation Model (PALM)
2 stars 5 forks source link

restart error 129 #9

Closed cbegeman closed 5 years ago

cbegeman commented 5 years ago

I noticed that somewhere along the line we broke the restart capability. The error is generated at https://github.com/xylar/palm_les_lanl/blob/74b332fd5bd95b45efbca99b17b35ee1b8230805/trunk/SOURCE/netcdf_interface_mod.f90#L2462

I went back to @vanroekel 's old "palm_les_updates" version and verified that restart worked there. It did run successfully, but with a different error message: errors in local file ENVPAR some variables for steering may not be properly set

Do any of you know of a version of the code where you had a successful restart? Or do you have ideas about what the source of the issue is? Thanks!

vanroekel commented 5 years ago

@cbegeman thanks for noting this. Do you have an example run path (case and run directory)? I'd be happy to take a look.

cbegeman commented 5 years ago

@vanroekel thanks! check out /lustre/scratch3/turquoise/cbegeman/palm/jobs/test_grizzly I've put the sbatch log file there as well. Let me know if that's enough information for you to go on.

vanroekel commented 5 years ago

looks like I don't have access to that space, all the way back to your user level (/lustre/scratch3/turquoise/cbegeman).

cbegeman commented 5 years ago

Where should I put the files?

On Oct 24, 2018, at 12:13 PM, Luke Van Roekel notifications@github.com<mailto:notifications@github.com> wrote:

looks like I don't have access to that space, all the way back to your user level (/lustre/scratch3/turquoise/cbegeman).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/xylar/palm_les_lanl/issues/9#issuecomment-432769876, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALe7NrruLcwWbcHNBmy8ulKlEGV6Mnzhks5uoK2ugaJpZM4X4XXn.

xylar commented 5 years ago

You just need to open read permission and execute permission on folders to either the climate group or world. For example

chown -R cbegeman:climate . chmod -R g+rX

or

chmod -R go+rX

cbegeman commented 5 years ago

Done. Thanks!

On Oct 24, 2018, at 1:01 PM, Xylar Asay-Davis notifications@github.com<mailto:notifications@github.com> wrote:

You just need to open read permission and execute permission on folders to either the climate group or world. For example

chown -R cbegeman:climate

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/xylar/palm_les_lanl/issues/9#issuecomment-432787294, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALe7NrEEoy-0U3RbwJp-CvjODIjq47y5ks5uoLj7gaJpZM4X4XXn.

vanroekel commented 5 years ago

@cbegeman an update. I can get restarts to work if I don't do horizontal sections (remove shf*_xy from data_output in the namelist). I'm not sure why the slice is not working yet.

cbegeman commented 5 years ago

Thanks @vanroekel. I'll see whether my new dirichlet bc case restarts with that option removed.

vanroekel commented 5 years ago

I think I see what is happening now. When you have a variable like shf*_xy in data_output it implies a surface variable, but the model still requires a section to be defined see here https://github.com/xylar/palm_les_lanl/blob/master/trunk/SOURCE/netcdf_interface_mod.f90#L1970-L1978 When you don't define a section the code returns and the header becomes ill defined. I think there are two solutions

  1. Add something like

    section_xy = 1,

    to your namelist file in the runtime parameters section. for * variables the value doesn't matter as it will output the surface either way. For others, this chooses the vertical position of the slice

  2. we could loop through the requested variables and if they are only surface based, skip the vertical coordinate definitions in the referenced section.

I would pretty strongly suggest sticking with option 1 and closing this issue. But please do let me know what you think @cbegeman and @xylar. Also pinging @qingli411 and @lconlon for their thoughts.

vanroekel commented 5 years ago

note, I ran a case with your namelist file using option 1 above and restart worked fine.

cbegeman commented 5 years ago

Thanks, @vanroekel. Option 1 sounds like the most straightforward solution to me.

cbegeman commented 5 years ago

@vanroekel, I've set section_xy = 1 in that namelist file, leaving shf*_xy, and I get a new error in combine_plot_fields. Have you encountered this? Can you share the namelist file that you used to get a successful run?

 NetCDF output enabled
 XY-section:            64  file(s) found

forrtl: severe (67): input statement requires too much data, unit 110, file /lustre/scratch3/turquoise/cbegeman/palm/jobs/test_restart_1/RUN_ifort.grizzly_hdf5_mpirun_test_oceanml/PLOT2D_XY_000000 Image PC Routine Line Source
combine_plot_fiel 000000000041D14E forio_return Unknown Unknown combine_plot_fiel 000000000043F571 for_read_seq_xmit Unknown Unknown combine_plot_fiel 000000000040A918 Unknown Unknown Unknown combine_plot_fiel 0000000000408FAE Unknown Unknown Unknown libc-2.17.so 00002AE5586A73D5 libc_start_main Unknown Unknown combine_plot_fiel 0000000000408EA9 Unknown Unknown Unknown

vanroekel commented 5 years ago

I haven't seen that. But this looks like an error in combine_plot_fields, and not the model itself. Either way here is my file. Nothing jumps out at me as different. My only suggestion is perhaps trying fewer processors, your domain is 32x32x32 and you are using 64 processors. I've had some issues with using a lot of processors for a small domain.

&initialization_parameters
        nx = 63, ny = 63, nz=64,
        dx = 2.5, dy = 2.5, dz = 2.5,

        fft_method = 'temperton-algorithm',

        ocean = .T.,
        idealized_diurnal = .T.,

        linear_eqnOfState = .FALSE.
        rho_ref = 1000.0
        fixed_alpha = .TRUE.
        alpha_const = 2.0E-4
        beta_const = 8.0E-4 
        pt_ref = 15.0
        sa_ref = 35.0

        loop_optimization = 'vector',

        initializing_actions = 'read_restart_data'

        latitude = 55.6,

        momentum_advec = 'pw-scheme',
        scalar_advec = 'pw-scheme', 

        ug_surface =0.0, vg_surface = 0.0,
        pt_surface                 = 276.74,
        pt_vertical_gradient       = -54.,-0.5,
        pt_vertical_gradient_level = -44.,-52.,
        sa_surface                 = 7.65,
        sa_vertical_gradient       = -70.0,-18.0,
        sa_vertical_gradient_level = -44.,-53.,

        use_top_fluxes= .T.,
        use_surface_fluxes = .F.,
        constant_flux_layer= .F.,

        top_momentumflux_u = 0.0,
        top_momentumflux_v = 0.0,

        top_heatflux = 0., 
        top_salinityflux = 0.0,

        bc_uv_b = 'neumann', bc_uv_t = 'neumann', 
        bc_pt_b = 'neumann', bc_pt_t = 'neumann',
        bc_p_b  = 'neumann', bc_p_t  = 'neumann',
        bc_s_b  = 'initial_gradient', bc_s_t  = 'neumann',
        bc_sa_t = 'neumann', /

&runtime_parameters
        end_time = 120000.0,
        create_disturbances = .T.,
        disturbance_energy_limit = 1.0e-2,
!        disturbance_level_b = -4.,
        dt_disturb = 150.,
        dt_run_control = 0.0,
        dt_data_output = 600.0,
        dt_dopr = 600.0,
        dt_data_output_av = 600.,
        section_xy = 1,

        netcdf_data_format = 3,

        data_output = 'shf*_xy', 'e', 'pt', 'sa', 'u', 'v', 'w', 'rho_ocean', 'alpha_T', 'solar3d', 

        data_output_pr = 'e','e*', '#pt', '#sa', 'p', 'hyp', 'km', 'kh', 'l', 
              '#u','#v','w','prho','w"u"','w*u*','w"v"','w*v*','w"pt"','w*pt*',
xylar commented 5 years ago

@cbegeman, typically we would leave the issue open until the PR to fix it has been merged. It's also customary to say in the issue that it was fixed by a PR, in this case #11.

cbegeman commented 5 years ago

@xylar got it, thanks.

xylar commented 5 years ago

No problem! I know how satisfying it can be to close an issue as fixed so I'm sorry to take that away from you ;-)

vanroekel commented 5 years ago

addressed by #11