ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
268 stars 125 forks source link

Writer peer has failed, failing any pending requests #3827

Open halehawk opened 11 months ago

halehawk commented 11 months ago

Recently, I got "Writer peer has failed, failing any pending requests" from an adios2 variable SST put/get operation on derecho cluster. The log is attached. I checked that BP file mode works. In SST mode, other variables work, the variable's first and second step work and fails on the third step. I am sure it is not memory issue. Could you please tell me if this is the problem with adios2 or problem with my application? How can I fix it? Thanks! muram_io.o1412591.txt

eisenhauer commented 11 months ago

Thanks for the detailed logs, but we're probably going to need a bit more to sort this out. In this case it looks like the reader is behaving as expected, but the writer is exiting early due to some problem. From the log, it looks like the reader has processed 11 steps and has received the metadata for timestep 12, but the writer dies before it can start processing. On the writer side, we're in EndStep for timestep 13 (the "DP Writer ... ProvideTimestep" message is the first verbose message inside SST EndStep on the writer side), but we don't get as far as "Writer ... Sending TimestepMetadata" which happens a bit further in. You say that you are sure it's not a memory issue, can I ask specifically what you've seen that eliminates that possibility? Because my first guess here would be that maybe we're running out of memory on the writer side. The other thing I see, starting at timestep 3, is that some writers are providing no data to the SST runtime. (This is the "ProvideTimestep, registering timestep 3, data (nil), fprint 0" log entry.). Just what's going on there I'm not clear on. It's allowable for an application to only do BeginStep/EndStep with no Put()s in between, on some or even all ranks. If that's what your application is doing, then that would explain those odd entries. If not, it would mean that something is going on much earlier than the time when the application is dying (which may be what you mean by "failing on the third step").

BTW, as a side note. SST can increase your applications memory utilization by quite a bit more than a file engine. ADIOS semantics require the data for timestep T to be queued on the writer side until the reader has finished processing that timestep. On the writer side there are number of engine parameters that control how big that queue is allowed to grow, what happens when we reach the queue limit (blocking or discarding the data), etc. By default there are no limits on queue size, so if the reader is slow and the writer produces a lot of data, the queue can be big which increases application memory pressure. (This differs from the file engine where data is always written to disk in EndStep and never remains to take up application memory.). If you suspect that running out of memory is a possibility, then you might try setting the SST Engine parameter "QueueLimit" to 1.

halehawk commented 11 months ago

Thank you for helping me with this issue. I set QueueLimit to 1 already. You can find out from the log file. Also there are four routines including this failure one using one third of the ranks to put/get data, the begin/end steps are still on all ranks. I have no problem with three routines, but this corona emission routine outputs four 27x1024x1024 variables on one third of rank. To get rid of memory concern, I ran only the corona routine to test on outputting at every 100 steps and every 200 steps, and it still got me the writer peer failure on the 400th step.

On Wed, Sep 27, 2023 at 6:25 AM Greg Eisenhauer @.***> wrote:

Thanks for the detailed logs, but we're probably going to need a bit more to sort this out. In this case it looks like the reader is behaving as expected, but the writer is exiting early due to some problem. From the log, it looks like the reader has processed 11 steps and has received the metadata for timestep 12, but the writer dies before it can start processing. On the writer side, we're in EndStep for timestep 13 (the "DP Writer ... ProvideTimestep" message is the first verbose message inside SST EndStep on the writer side), but we don't get as far as "Writer ... Sending TimestepMetadata" which happens a bit further in. You say that you are sure it's not a memory issue, can I ask specifically what you've seen that eliminates that possibility? Because my first guess here would be that maybe we're running out of memory on the writer side. The other thing I see, starting at timestep 3, is that some writers are providing no data to the SST runtime. (This is the "ProvideTimestep, registering timestep 3, data (nil), fprint 0" log entry.). Just what's going on there I'm not clear on. It's allowable for an application to only do BeginStep/EndStep with no Put()s in between, on some or even all timesteps. If that's what your application is doing, then that would explain those odd entries. If not, it would mean that something is going on much earlier than the time when the application is dying (which may be what you mean by "failing on the third step").

BTW, as a side note. SST can increase your applications memory utilization by quite a bit more than a file engine. ADIOS semantics require the data for timestep T to be queued on the writer side until the reader has finished processing that timestep. On the writer side there are number of engine parameters that control how big that queue is allowed to grow, what happens when we reach the queue limit (blocking or discarding the data), etc. By default there are no limits on queue size, so if the reader is slow and the writer produces a lot of data, the queue can be big which increases application memory pressure. (This differs from the file engine where data is always written to disk in EndStep and never remains to take up application memory.). If you suspect that running out of memory is a possibility, then you might try setting the SST Engine parameter "QueueLimit" to 1.

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/3827#issuecomment-1737290025, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFEPCNA7RSCZ3V5TIHLX4QLLLANCNFSM6AAAAAA5IJYC2Q . You are receiving this because you authored the thread.Message ID: @.***>

eisenhauer commented 11 months ago

No error messages on the writer side? Seg faults messages? Anything?

halehawk commented 11 months ago

The log includes all the output and error messages already.

On Wed, Sep 27, 2023 at 12:06 PM Greg Eisenhauer @.***> wrote:

No error messages on the writer side? Seg faults messages? Anything?

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/3827#issuecomment-1737856494, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFBOAQ4FI7VIO56D6PTX4RTLFANCNFSM6AAAAAA5IJYC2Q . You are receiving this because you authored the thread.Message ID: @.***>

eisenhauer commented 11 months ago

I will dig through it when I can.

halehawk commented 11 months ago

Thank you for your help. I suddenly solved the problem after I ran the job on more nodes from 16 nodes to 24 nodes. So I thought you are right, the writers need more memory in the queue per node. But writers cannot send me a memory error before they stop working. Is there any way to know the memory limits in the queue or can I set it to anything?

I have another problem compiling adios2 on derecho. Here is my cmake command: cmake -DCMAKE_INSTALL_PREFIX=/glade/derecho/scratch/haiyingx/ADIOS2_derecho/install -DADIOS2_USE_Python=ON -DPython_EXECUTABLE=/glade/work/haiyingx/conda-envs/mpich_sperr_de/bin/python -DFLEX_EXECUTABLE=/usr/bin/flex -DADIOS2_USE_MPI=ON .. But I always got this kind of error for compiling fortran code in adios2: nvlink error : Undefined reference to '_adios2_parameters_mod_21' in 'CMakeFiles/TestCommonWrite_f.dir/TestCommonWriteF.F90.o' pgacclnk: child process exit status 2: /glade/u/apps/common/23.04/spack/opt/spack/nvhpc/23.1/Linux_x86_64/23.1/compilers/bin/tools/nvdd make[2]: [testing/adios2/engine/staging-common/CMakeFiles/TestCommonWrite_f.dir/build.make:122: bin/TestCommonWrite_f] Error 2 make[1]: [CMakeFiles/Makefile2:10897: testing/adios2/engine/staging-common/CMakeFiles/TestCommonWrite_f.dir/all] Error 2 make: *** [Makefile:146: all] Error 2

Do you know how I can solve it?

On Wed, Sep 27, 2023 at 8:04 PM Greg Eisenhauer @.***> wrote:

I will dig through it when I can.

— Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/3827#issuecomment-1738335441, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAPEFHB7ARNHA4YWMSHWJDX4TLKNANCNFSM6AAAAAA5IJYC2Q . You are receiving this because you authored the thread.Message ID: @.***>

eisenhauer commented 11 months ago

Glad you have a workaround. Something else that might be useful that we haven't done yet: Currently if you specify QueueLimit=1 and you try to submit another timestep, we block in EndStep. However, if we block in EndStep it happens after all the data buffering has already happened and merely prevents the marshalled data from being added to the queue. That means that we already have two timesteps in memory, even if we've only allowed one of them to be queued. A better solution would be to block in BeginStep() so that we don't proceeded to marshal data until we know there's a place to put it. The difficulty with that is that BeginStep() has not traditionally been a collective operation in ADIOS, but this feature would need it to be collective so that every rank comes to the same decision about blocking. So, this is on our to-do-if-we-can-figure-it-out list.

WRT the fortran problem. I've seen this sort of thing when somehow different fortran compilers were used for different pieces of the code. It looks like you're using NVIDIA compilers, do you specify them explicitly with the FC environment variable? Tagging @anagainaru as possibly being able to help more with the NVIDIA/fortran side.

anagainaru commented 11 months ago

Hmm, I haven't played too much with fortran enabled while compiling with nvcc. I can give it a try. What is your CXX and C compiler?

halehawk commented 11 months ago

I tried set CC and not set CC, both gave me the errors. I want to instruct our scientists to install adios2, and use it. So I really hope you can solve this problem for me. Here is the cmake log. CXX is nvc++, fortran is nvfortran. If I set CC=mpicc, it is pointing to /glade/.../derecho/23.06/spack/opt/spack/ncarcompilers/1.0.0/nvhpc/23.1/rqst/bin/mpi/mpicc

adios.cmake.log

anagainaru commented 11 months ago

Oki thanks, I will try to reproduce this and get back to you.

eisenhauer commented 11 months ago

That there is "mod" in the link error and it's trying to link a fortran executable means that this is fundamentally a fortran problem. The cmake output looks OK. You might try cleaning your build environment and starting with an empty directory just to make sure that this isn't a problem with an object file leftover from a previous build (when maybe the environment was incompatible). (As a last resort, if you scientists won't need fortran, you can disable it in the ADIOS build. But obviously better to solve the problem.)

halehawk commented 11 months ago

I did removed the build directory and started with a clean build. But the error still exists. I thought it might related with 21 in "_adios2_parameters_mod_21".

eisenhauer commented 11 months ago

No idea where the 21 comes from, but adios2_parameters_mod.F90 is fortran module source. That there's an undefined reference to it probably means something has gone seriously wrong on the fortran side.

halehawk commented 11 months ago

I looked gnu compiled adios2 fortran, the adios2_parameters_mod is built as follows: adios2_parameters_mod_MOD_adios2_null_dims 000000000001d3f0 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_adios 000000000001d3d0 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_attribute 000000000001d390 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_engine 000000000001d370 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_io 000000000001d360 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_namestruct 000000000001d300 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_operator 000000000001d2e0 T adios2_parameters_mod_MOD_copy_adios2_parameters_mod_Adios2_variable 0000000000071800 R adios2_parameters_mod_MOD_def_init_adios2_parameters_mod_Adios2_adios 00000000000707e0 R adios2_parameters_modMODdef_init_adios2_parameters_mod_Adios2_attribute 0000000000070780 R adios2_parameters_modMODdef_init_adios2_parameters_mod_Adios2_engine 0000000000070760 R adios2_parameters_modMODdef_init_adios2_parameters_mod_Adios2_io 0000000000070730 R adios2_parameters_modMODdef_init_adios2_parameters_mod_Adios2_namestruct 00000000000706a0 R adios2_parameters_modMODdef_init_adios2_parameters_mod_Adios2_operator 000000000006f680 R adios2_parameters_modMODdef_init_adios2_parameters_mod_Adios2_variable 0000000000282ce0 D adios2_parameters_modMODvtab_adios2_parameters_mod_Adios2_adios 0000000000282ca0 D adios2_parameters_mod_MOD_vtab_adios2_parameters_mod_Adios2_attribute 0000000000282c60 D adios2_parameters_mod_MOD_vtab_adios2_parameters_mod_Adios2_engine 0000000000282c20 D adios2_parameters_mod_MOD_vtab_adios2_parameters_mod_Adios2_io 0000000000282be0 D adios2_parameters_mod_MOD_vtab_adios2_parameters_mod_Adios2_namestruct 0000000000282ba0 D adios2_parameters_mod_MOD_vtab_adios2_parameters_mod_Adios2_operator 0000000000282b60 D adios2_parameters_mod_MOD___vtab_adios2_parameters_mod_Adios2_variable

But the nvfortran compiled adios2 fortran is as follows: adios2_parametersmod 0000000000292a40 D _adios2_parameters_mod21 00000000002908c0 D _adios2_parameters_mod8 0000000000292e10 D adios2_parameters_mod_adios2_adiostd_ 0000000000292c30 D adios2_parameters_mod_adios2_attributetd_ 0000000000292b90 D adios2_parameters_mod_adios2_enginetd_ 0000000000292d70 D adios2_parameters_mod_adios2_iotd_ 0000000000292a50 D adios2_parameters_mod_adios2_namestructtd_ 0000000000292af0 D adios2_parameters_mod_adios2_operatortd_ 0000000000292cd0 D adios2_parameters_mod_adios2_variabletd_ So some definitions are missing in nvfortran compiled adios2 fortran.

anagainaru commented 11 months ago

I was unable to reproduce this problem. I am using nvhpc 21.9.

 cmake -DADIOS2_USE_Python=ON \
       -DADIOS2_USE_MPI=ON \
       -DADIOS2_USE_Fortran=ON  \
       -DCMAKE_CXX_COMPILER=nvc++  ..

This compiles fine. Could it be the nvhpc version?

halehawk commented 11 months ago

I used nvhpc 23.1 at least. Could you please try this version since our fortran code have a lot problems with nvhpc 21 or 22?Sent from my iPhoneOn Oct 6, 2023, at 10:27 PM, Ana Gainaru @.***> wrote: I was unable to reproduce this problem. I am using nvhpc 21.9. cmake -DADIOS2_USE_Python=ON \ -DADIOS2_USE_MPI=ON \ -DADIOS2_USE_Fortran=ON \ -DCMAKE_CXX_COMPILER=nvc++ ..

This compiles fine. Could it be the nvhpc version?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

anagainaru commented 11 months ago

It works with nvhpc 22.11 which is the newest version on Summit. I will try to install the latest nvhpc and try again. In the mean time could you try not building the testing and examples just to figure out if the problem is with one of the Testing files (add -D BUILD_TESTING=OFF -D ADIOS2_BUILD_EXAMPLES=OFF to cmake) and let me know if you still see the error.

halehawk commented 11 months ago

Testing and examples are two of several directories that need Adios2 parameters mod. Adios2 fortran MPI build needs this mod as well, so it gave me the same error.Sent from my iPhoneOn Oct 7, 2023, at 4:54 PM, Ana Gainaru @.***> wrote: It works with nvhpc 22.11 which is the newest version on Summit. I will try to install the latest nvhpc and try again. In the mean time could you try not building the testing and examples just to figure out if the problem is with one of the Testing files (add -D BUILD_TESTING=OFF -D ADIOS2_BUILD_EXAMPLES=OFF to cmake) and let me know if you still see the error.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>