trixi-framework / Trixi.jl

Trixi.jl: Adaptive high-order numerical simulations of conservation laws in Julia
https://trixi-framework.github.io/Trixi.jl
MIT License
536 stars 109 forks source link

HDF5 issue with parallel execution on clusters #1109

Open peyvanahmad opened 2 years ago

peyvanahmad commented 2 years ago

Hello,

I am trying to run Mach3 step test problem on a cluster using MPI. The program raises an error related to HDF5 when the mesh file or solution file is being written in the "out" folder. Sometime when I change the number of cores the simulation runs and solution files are written just fine but after some time into the simulation the HDF5 error will be raised again. Here is a sample of the error that I get:

ERROR: ERROR: LoadError: LoadError: HDF5.API.H5Error: Error getting attribute name libhdf5 Stacktrace: [1] H5Aget_name: Invalid arguments to routine/Inappropriate type not an attribute Stacktrace: [1] macro expansion @ ~/.julia/packages/HDF5/wWr4z/src/api/HDF5.API.H5Error: Error getting attribute name libhdf5 Stacktrace:error.jl:18 [inlined] [2] [1] h5a_get_name(H5Aget_name: Invalid arguments to routine/Inappropriate typeattr_id:: not an attribute

The operating system on the cluster is Red Hat. I am using openmpi_4.0.0_gcc version and the file system is GPFS.

sloede commented 2 years ago

Thanks for reaching out, @peyvanahmad! Could you let me know a little more about the circumstances of when the error occurs:

Finally, it would be helpful if you could post the full error message, or at least the full stacktrace to figure out in which routine the error occurs.

peyvanahmad commented 2 years ago

It happens when running in parallel. It happens sporadically. Sometimes the simulation goes through to the end in parallel and sometimes the HDF5 error occurs. When I run on a single computational node the error appears less frequently. I used 32 ranks on a single computational node (I have attached the complete error file here). The error happened at 88,000 time step I use the auto-installed HDF5 package by Julia. I try to run the code with a different MPI library but openmpi_4.0.0_gcc is the one that is very robust.

On Sat, Apr 2, 2022 at 11:55 PM Michael Schlottke-Lakemper < @.***> wrote:

Thanks for reaching out, @peyvanahmad https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpeyvanahmad&data=04%7C01%7Capeyva2%40groute.uic.edu%7Cafbb35cf341342f1ca9708da1525da83%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637845549565766903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=1vEAgx9ToM10LtEUgAhlve%2BDuQQF2Xtd%2B%2Bt7PbvBOB0%3D&reserved=0! Could you let me know a little more about the circumstances of when the error occurs:

Finally, it would be helpful if you could post the full error message, or at least the full stacktrace to figure out in which routine the error occurs.

— Reply to this email directly, view it on GitHub https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftrixi-framework%2FTrixi.jl%2Fissues%2F1109%23issuecomment-1086771100&data=04%7C01%7Capeyva2%40groute.uic.edu%7Cafbb35cf341342f1ca9708da1525da83%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637845549565766903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c3iN8aHmrkR%2FNTLIEjSEFGOBJXfleex8GkYDe1E5Qnk%3D&reserved=0, or unsubscribe https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGKTJB73EN3HMIHNUNT4STLVDEJEVANCNFSM5SMQCHBA&data=04%7C01%7Capeyva2%40groute.uic.edu%7Cafbb35cf341342f1ca9708da1525da83%7Ce202cd477a564baa99e3e3b71a7c77dd%7C0%7C0%7C637845549565766903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WU9RzMuwGlYcLwd10GWgn%2FHYx5Q%2B9WILazHdSND87Cg%3D&reserved=0 . You are receiving this because you were mentioned.Message ID: @.***>

-- Ahmad Peyvan Ph.D Candidate Department of Mechanical and Industrial Engineering University of Illinois at Chicago (UIC) Email: @.***

sloede commented 2 years ago

I have attached the complete error file here

Unfortunately I cannot find anything - can you please try again to add it to the GitHub issue on the website directly? Also, it would be great if you can include the elixir you're using the exact Julia command you used for starting it.