Open mrhardman opened 1 month ago
It might be worth noting that this bug was found using the serial HDF5 i/o where a hdf5 file is created for each core in the job (I am not using the system HDF5). With the same options and one node (with still 64 cores), my setup appears to use a single hdf5 output file. I suspect that if there is a problem, it is in some undefined behaviour related to not using parallel HDF5 output. @johnomotani Do you ever test the serial HDF5 behaviour these days?
After confirming that using parallel HDF5 is not trivial on my HPC system (it introduces new I/O errors, suggesting library mismatches), I found some evidence that the problem might be due to other causes:
When running in debug mode (level 2), the only sensible printed error is
┌ Warning: attempting to remove probably stale pidfile
│ path = "*/.julia/logs/manifest_usage.toml.pid"
└ @ FileWatching.Pidfile /*/Julia/1.10.0-linux-x86_64/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:273
This looks similar to errors reported here https://github.com/JuliaLang/julia/issues/51983.
Further update: I was able to run this test on 2 nodes successfully without errors, but this was not reproducible, meaning that when I tried to confirm the test result I then got the pidfile errors above. This is consistent with the issue report which suggests that the fault is not reproducible.
When running
examples/fokker-planck-1D2V/fokker-planck-1D2V-even_nz-shorttest-nstep200.toml
on 128 cores with two nodes, andz_nelement_local = 1
, I find the following error.