mabarnes / moment_kinetics

Other
2 stars 4 forks source link

I/O Error when running examples/fokker-planck-1D2V/fokker-planck-1D2V-even_nz-shorttest-nstep200.toml on 2 nodes #215

Open mrhardman opened 1 month ago

mrhardman commented 1 month ago

When running examples/fokker-planck-1D2V/fokker-planck-1D2V-even_nz-shorttest-nstep200.toml on 128 cores with two nodes, and z_nelement_local = 1, I find the following error.

Base.IOError on process 0:
IOError: stat(RawFD(78)): Unknown system error -116 (Unknown system error -116)
Stacktrace:
  [1] uv_error
    @ Base ./libuv.jl:100 [inlined]
  [2] stat(fd::RawFD)
    @ Base.Filesystem ./stat.jl:152
  [3] stat
    @ Base.Filesystem ./filesystem.jl:281 [inlined]
  [4] close(lock::FileWatching.Pidfile.LockMonitor)
    @ FileWatching.Pidfile /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:336
  [5] mkpidlock(f::Pkg.Types.var"#51#54"{String, String, Dates.DateTime, String}, at::String, pid::Int32; kwopts::@Kwargs{stale_age::Int64})
    @ FileWatching.Pidfile /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:95
  [6] mkpidlock
    @ FileWatching.Pidfile /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:90 [inlined]
  [7] mkpidlock
    @ FileWatching.Pidfile /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:88 [inlined]
  [8] write_env_usage(source_file::String, usage_filepath::String)
    @ Pkg.Types /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/Types.jl:539
  [9] Pkg.Types.EnvCache(env::Nothing)
    @ Pkg.Types /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/Types.jl:377
 [10] EnvCache
    @ Pkg.Types /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/Types.jl:356 [inlined]
 [11] dependencies
    @ Pkg.API /work/admin/easybuild/software/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/Pkg/src/API.jl:85 [inlined]
 [12] macro expansion
    @ moment_kinetics.file_io /*/excalibur/moment_kinetics_newgeo/moment_kinetics/src/file_io.jl:370 [inlined]
 [13] macro expansion
    @ moment_kinetics.file_io /*/excalibur/moment_kinetics_newgeo/moment_kinetics/src/looping.jl:803 [inlined]
 [14] write_provenance_tracking_info!(fid::HDF5.File, parallel_io::Bool, run_id::String, restart_time_index::Int64, input_dict::Dict{String, Any}, previous_runs_info::Nothing)
    @ moment_kinetics.file_io /*/excalibur/moment_kinetics_newgeo/moment_kinetics/src/file_io.jl:317
 [15] macro expansion
    @ moment_kinetics.file_io /*/excalibur/moment_kinetics_newgeo/moment_kinetics/src/file_io.jl:1273 [inlined]
 [16] macro expansion
    @ moment_kinetics.file_io /*/excalibur/moment_kinetics_newgeo/moment_kinetics/src/looping.jl:803 [inlined]
 [17] setup_dfns_io(prefix::String, binary_format::moment_kinetics.input_structs.binary_format_type, boundary_distributions::moment_kinetics.moment_kinetics_structs.boundary_distributions_struct, r::moment_kinetics.coordinates.coordinate{Vector{Float64}}, z::moment_kinetics.coordinates.coordinate{Vector{Float64}}, vperp::moment_kinetics.coordinates.coordinate{Vector{Float64}}, vpa::moment_kinetics.coordinates.coordinate{Vector{Float64}}, vzeta::moment_kinetics.coordinates.coordinate{Vector{Float64}}, vr::moment_kinetics.coordinates.coordinate{Vector{Float64}}, vz::moment_kinetics.coordinates.coordinate{Vector{Float64}}, composition::moment_kinetics.input_structs.species_composition, collisions::moment_kinetics.input_structs.collisions_input, evolve_density::Bool, evolve_upar::Bool, evolve_ppar::Bool, external_source_settings::@NamedTuple{ion::@NamedTuple{source_T::Float64, active::Bool, sink_strength::Float64, r_relative_minimum::Float64, source_vperp0::Float64, recycling_controller_fraction::Float64, source_strength::Float64, source_vpa0::Float64, PI_density_target_r_relative_minimum::Float64, sink_vth::Float64, PI_density_target_r_profile::String, z_width::Float64, source_type::String, PI_density_target_r_width::Float64, source_v0::Float64, PI_density_controller_I::Float64, r_profile::String, z_profile::String, PI_density_target_z_relative_minimum::Float64, PI_density_controller_P::Float64, PI_density_target_amplitude::Float64, z_relative_minimum::Float64, PI_density_target_z_width::Float64, r_width::Float64, PI_density_target_z_profile::String, source_n::Float64, r_amplitude::Vector{Float64}, z_amplitude::Vector{Float64}, PI_density_target::Nothing, PI_controller_amplitude::Nothing, controller_source_profile::Nothing, PI_density_target_ir::Nothing, PI_density_target_iz::Nothing, PI_density_target_rank::Nothing}, neutral::@NamedTuple{source_T::Float64, active::Bool, sink_strength::Float64, r_relative_minimum::Float64, source_vperp0::Float64, recycling_controller_fraction::Float64, source_strength::Float64, source_vpa0::Float64, PI_density_target_r_relative_minimum::Float64, sink_vth::Float64, PI_density_target_r_profile::String, z_width::Float64, source_type::String, PI_density_target_r_width::Float64, source_v0::Float64, PI_density_controller_I::Float64, r_profile::String, z_profile::String, PI_density_target_z_relative_minimum::Float64, PI_density_controller_P::Float64, PI_density_target_amplitude::Float64, z_relative_minimum::Float64, PI_density_target_z_width::Float64, r_width::Float64, PI_density_target_z_profile::String, source_n::Float64, r_amplitude::Vector{Float64}, z_amplitude::Vector{Float64}, PI_density_target::Nothing, PI_controller_amplitude::Nothing, controller_source_profile::Nothing, PI_density_target_ir::Nothing, PI_density_target_iz::Nothing, PI_density_target_rank::Nothing}}, input_dict::Dict{String, Any}, parallel_io::Bool, io_comm::MPI.Comm, run_id::String, restart_time_index::Int64, previous_runs_info::Nothing, time_for_setup::Float64)
    @ moment_kinetics.file_io /*/excalibur/moment_kinetics_newgeo/moment_kinetics/src/file_io.jl:1257
mrhardman commented 1 month ago

It might be worth noting that this bug was found using the serial HDF5 i/o where a hdf5 file is created for each core in the job (I am not using the system HDF5). With the same options and one node (with still 64 cores), my setup appears to use a single hdf5 output file. I suspect that if there is a problem, it is in some undefined behaviour related to not using parallel HDF5 output. @johnomotani Do you ever test the serial HDF5 behaviour these days?

mrhardman commented 1 month ago

After confirming that using parallel HDF5 is not trivial on my HPC system (it introduces new I/O errors, suggesting library mismatches), I found some evidence that the problem might be due to other causes:

When running in debug mode (level 2), the only sensible printed error is

┌ Warning: attempting to remove probably stale pidfile
│   path = "*/.julia/logs/manifest_usage.toml.pid"
└ @ FileWatching.Pidfile /*/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:273

This looks similar to errors reported here https://github.com/JuliaLang/julia/issues/51983.

Further update: I was able to run this test on 2 nodes successfully without errors, but this was not reproducible, meaning that when I tried to confirm the test result I then got the pidfile errors above. This is consistent with the issue report which suggests that the fault is not reproducible.