Open jhgoebbert opened 1 year ago
Thanks. Probably we need more detail from the variable declarations, any calls to SetShape, etc. If the variable is classified as a GlobalArray, the Shape value presumably was set at some point. The question is what happened to it? Was it reset somehow? Was the variable destroyed? Maybe you can point me to the source somewhere?
Hi @eisenhauer,
thank you for your reply. I have created a small demo -> adios2_segfaultExample.tar.gz showing the segfault based on this SENSEI-miniapp from HERE
You can find
sensei_oscillator.segfault/srun-7939151.sim
- which shows the segfaultsensei_oscillator.segfault/oscillator.slurm
- 2 nodes are running the simulation and 1 node the SENSEI endpointconfigs/
- I tried to reduce the number of "special" settingsLet me know if this is helpful and if any further information would be useful.
Hmm. I haven't looked at Sensei source before. C++ code using the ADIOS C interfaces. That complicates things a bit and at least my initial look at Sensei, I don't have a good guess as to what might be going on. The ADIOS change that seems to be implicated here, grabbing the Shape of a variable at EndStep rather than at Put so that it can be changed after the Put, shouldn't have an impact to anything I see, but the logic is complex enough that I can't be completely sure of what is going on just by inspection. Unfortunately that probably means trying to reproduce this somewhere I can either examine a core dump or add additional diagnostics, which may not happen right away. Will do what I can though.
I tried to find the commit which introduced this issue and went back to the 14th of March for now.
BP5Serializer::CollectFinalShapeValues()
is not yet introduced but it still segfault at a different position:
[jwb0021:11225:0:11225] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1892a660)
[jwb0021:11222:0:11222] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x92f74a0)
[jwb0021:11223:0:11223] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe08c370)
[jwb0021:11224:0:11224] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x148c1910)
[jwb0033:23160:0:23160] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14ef6fd0)
[jwb0033:23158:0:23158] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11461ab0)
[jwb0033:23161:0:23161] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1ca7b920)
[jwb0033:23159:0:23159] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x118dcee0)
==== backtrace (tid: 23161) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x00000000000d01f8 __memmove_avx_unaligned_erms() :0
2 0x000000000000e2b4 copy_data_to_tmp() ???:0
3 0x0000000000010bd5 handle_subfield() ffs.c:0
4 0x000000000001067d handle_subfield() ffs.c:0
5 0x00000000000112eb FFSencode_internal() ffs.c:0
6 0x0000000000599b37 adios2::format::BP5Serializer::CloseTimestep() ???:0
7 0x00000000006629cf adios2::core::engine::SstWriter::EndStep() ???:0
8 0x00000000000281ba adios2_end_step() ???:0
9 0x00000000001257fc sensei::ADIOS2AnalysisAdaptor::WriteTimestep() ???:0
10 0x000000000012bab6 sensei::ADIOS2AnalysisAdaptor::Execute() ???:0
11 0x000000000000c6c3 sensei::ConfigurableAnalysis::Execute() ???:0
12 0x000000000044138f bridge::execute() ???:0
13 0x0000000000410042 main() ???:0
14 0x000000000003ad85 __libc_start_main() ???:0
15 0x0000000000410ebe _start() ???:0
I will go back in time a bit further to older commits tomorrow ...
OK, so that puts a different spin on things. I don't recall exactly when we made BP5 the default serializer for SST, but I suspect that if this code worked on a prior version of ADIOS then perhaps it was using the older "bp" marshalling method. You can see if that works by setting the engine parameter "MarshalMethod" to a value of "bp" (even with the newest ADIOS). If it does, that may narrow down the problem.
You are right! With MarshalMethod = BP
the segfault is gone.
So you have a workaround for the moment. By and large, BP4 operates on metadata provided to it (shape, start, count arrays) at the moment of Put(), but BP5 gains efficiency through bulk processing in EndStep. In looking at Sensei code, it appears that the metadata arrays are often stack-allocated at the time of Put() and EndStep doesn't appear in the same subroutine, so it's a pretty good guess that somehow the deallocation of those arrays is tied to what's going on. We still have to sort out whether or not this is something that happens only when going through the C bindings or not, and how best to fix it. When you call things like adios_set_selection and adios_set_shape, you pass in the address of metadata arrays in application space, but I'm not sure we're clear on the requirements for how long that metadata should persist, if ADIOS commits to copying it when provided, etc. I think I can work from the Sensei code in ADIOS2Schema.cpp to replicate the issue in some test code to sort out exactly what's going on and where to go from here. It'll likely be a few days though.
If were easy to get a core dump file of the original failure in CollectFinalShapeValues and print VB->m_Name, that might help narrow down exactly which usage was problematic...
I have run the example with debug flags and core dumps enabled. Here it segfaults with
==== backtrace (tid: 4005) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x00000000000d0057 __memmove_avx_unaligned_erms() :0
2 0x0000000000ce057b adios2::format::BP5Serializer::CollectFinalShapeValues() /dev/shm/goebbert1/juwelsbooster/ADIOS2/20230620/foss-2022a-debug/ADIOS2-53acb22f0ed88b43a6bd6ca841aa6e1672a1d995/source/adios2/toolkit/format/bp5/BP5Serializer.cpp:1153
3 0x0000000000ce0fb5 adios2::format::BP5Serializer::CloseTimestep() /dev/shm/goebbert1/juwelsbooster/ADIOS2/20230620/foss-2022a-debug/ADIOS2-53acb22f0ed88b43a6bd6ca841aa6e1672a1d995/source/adios2/toolkit/format/bp5/BP5Serializer.cpp:1270
4 0x0000000000dd74ed adios2::core::engine::SstWriter::EndStep() /dev/shm/goebbert1/juwelsbooster/ADIOS2/20230620/foss-2022a-debug/ADIOS2-53acb22f0ed88b43a6bd6ca841aa6e1672a1d995/source/adios2/engine/sst/SstWriter.cpp:308
5 0x00000000000614b4 adios2_end_step() /dev/shm/goebbert1/juwelsbooster/ADIOS2/20230620/foss-2022a-debug/ADIOS2-53acb22f0ed88b43a6bd6ca841aa6e1672a1d995/bindings/C/adios2/c/adios2_c_engine.cpp:563
6 0x00000000002792f8 sensei::ADIOS2AnalysisAdaptor::WriteTimestep() /dev/shm/goebbert1/juwelsbooster/sensei/20230619/foss-2022a-adios2-20230620-catalyst-5.10.1-debug/SENSEI-8f71e07faa43f792ec473fa20c9cb4b183ad3d47/sensei/ADIOS2AnalysisAdaptor.cxx:522
7 0x000000000027591a sensei::ADIOS2AnalysisAdaptor::Execute() /dev/shm/goebbert1/juwelsbooster/sensei/20230619/foss-2022a-adios2-20230620-catalyst-5.10.1-debug/SENSEI-8f71e07faa43f792ec473fa20c9cb4b183ad3d47/sensei/ADIOS2AnalysisAdaptor.cxx:238
8 0x000000000001c987 sensei::ConfigurableAnalysis::Execute() /dev/shm/goebbert1/juwelsbooster/sensei/20230619/foss-2022a-adios2-20230620-catalyst-5.10.1-debug/SENSEI-8f71e07faa43f792ec473fa20c9cb4b183ad3d47/sensei/ConfigurableAnalysis.cxx:1555
9 0x000000000048f949 bridge::execute() /dev/shm/goebbert1/juwelsbooster/sensei/20230619/foss-2022a-adios2-20230620-catalyst-5.10.1-debug/SENSEI-8f71e07faa43f792ec473fa20c9cb4b183ad3d47/miniapps/oscillators/bridge.cpp:70
10 0x0000000000442f31 main() /dev/shm/goebbert1/juwelsbooster/sensei/20230619/foss-2022a-adios2-20230620-catalyst-5.10.1-debug/SENSEI-8f71e07faa43f792ec473fa20c9cb4b183ad3d47/miniapps/oscillators/main.cpp:302
11 0x000000000003ad85 __libc_start_main() ???:0
12 0x0000000000436d2e _start() ???:0
=================================
You can find the whole job-output including the core files here
Well, I thought I could easily recreate what I thought was happening in a simple test and debug the problem. I tried that and so far I've failed. I expect that I'm going to have to build SENSEI and try your examples in order to reproduce, but it might be a few weeks before I'm able to do that. Just FYI...
Thank you that you are looking into this.
Hi @eisenhauer, where you able to reproduce the problem with SENSEI?
Sorry, got as far as downloading SENSEI last week and then got distracted by a critical demo (and the need to wipe and reinstall my laptop because of an ongoing problem). This is on my list for this week, possibly later today.
Cool, I keep my fingers crossed :)
OK, I've spent enough time on this that I've got it running, but I'm not able to reproduce the problem. Some things to note: I built with SENSEI github master and ADIOS2 github master (which is close enough to 2.9.0 that it shouldn't matter). The first thing I found is that SENSEI had compilation failures with ADIOS 2.9.0 because of the changes in the ADIOS API (elimination of DebugMode in adios2_init()). I edited sensei/ADIOS2AnalysisAdaptor.cxx and sensei/ADIOS2Schema.cxx to eliminate the debug mode parameter and things compiled fine. I skipped the slurm script but instead ran the two clients using MPI on my laptop. I get no segfaults, but I do see some weird behaviour, some of which I can trace to the sensei-transport.xml file. For example RendezvousReaderCount=0 means that the oscillator can and will produce data that is dropped on the floor until the sensei process shows up. Then the QueueFullPolicy=discard also means that even after connected if the producer is producing data faster than they can be sent or consumed, that data will be discarded. (None of these mean that the code where you seemed to be seeing the segfault wouldn't be executed, it would. It's just that the data and metadata block it produced would be discarded.)
I guess the upshot is that I'm at a dead-end. I've tried to reproduce the issue both with and without Sensei without having any luck. I'm wondering a bit about what version of Sensei you might be using since I had to do source-level tweaks just to get it to compile with post-2.9.0 ADIOS. I've seen some anomalies, but nothing that should result in the symptoms that you are seeing. Not quite sure where to go from here...
Hi Greg, this is very surprising. I will go through this again based on you information and come back to you in the next days.
Thank you for looking into this!
Sorry, it takes longer than expected to get the time to go on. But I am on it ...
For now I was running the production runs with MarshalMethod = BP
.
ADIOS2 (2.9.0 and latest) segfaults in BP5Serializer::CollectFinalShapeValues() when using SST in combination with SENSEI (latest).
I assume this happens at THIS code line.