ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
267 stars 125 forks source link

SST Engine Hanging #2315

Closed AaronV77 closed 4 years ago

AaronV77 commented 4 years ago

Describe the bug We are using the SST Engine to communicate between two applications on our Cray HPC. On the setup of the SST reader or writer, one of them will hang and raise our time exception in our automation script. I have not been able to get an error from ADIOS2 yet or a return value.

To Reproduce I have not been able to reproduce the error consistently.

Expected behavior I expect that when a reader on one application and a writer on another application is set up that ADIOS2 will succeed and move on to the next instruction.

Desktop (please complete the following information):

Additional context Sorry for the lack of information about the bug, but it seems to have started happening when COVID-19 hit, and our HPC system got super busy. I don't know if the network traffic on Onyx is super congested and the connection between a reader and writer and is why they are hanging. Lastly, our applications are being controlled by an automated script that will use them in 96 separate tests. ADIOS2 is closed between each test but the same channel and IO names are being used for each test. I don't know if it is because of how we are running our applications with ADIOS2 that is causing confusion in the low level and making it hang.

eisenhauer commented 4 years ago

Hi Aaron. We just released ADIOS 2.6.0. A number of race conditions, some of which might result in deadlock, were discovered and fixed since the 2.4.0 release. Can I suggest that you upgrade to the most recent version of ADIOS and try your luck there? Also, lets make sure you are using the ADIOS/SST RDMA transport. This requires making sure that ADIOS finds a recent version of libfabric during its build process.

eisenhauer commented 4 years ago

One more thing, just in case you still are getting unpredictable failures with 2.6.0: If you set the SstVerbose environment variable, SST will dump out a lot of progress information that's invaluable for us in debugging such things...

AaronV77 commented 4 years ago

@eisenhauer is the SstVerbose environment variable something on the Cmake end or is it something that I set? That sounds like what is happening then because of it not being a consistent issue.

Question-1: What subsystem are you guys using for the SST engine?

Question-2: What is the buffer amount that the SST engine will send across the network? We've tested various amounts of data being sent using the SST engine and have noticed a flattened curve.

Thanks for your time!

eisenhauer commented 4 years ago

SstVerbose is a run-time shell environment variable that turns on verbosity in the job.

Q1: SST consists of a "control plane" and a "data plane". The control plane manages metadata delivery, opening, closing, stepping, etc. It's implemented using MPI (within each job) and a messaging layer from GaTech called EVPath (between jobs). For the data plane, we have a libfabric-based RDMA data plane that should be used when both reader and writer are on the same cluster, and an EVPath-based data plane that uses TCP sockets (intended for WAN use and debugging).

Q2: That's a hard question to answer because it really depends upon the circumstances, and different optimizations come into play in different use cases. Generally, ADIOS read selection semantics mean that SST doesn't know where written data should go until particular readers request particular bits. This implies a request/response protocol that can limit performance for the TCP data plane, but tends not to be a problem for the RDMA data plane. The TCP data plane does proactively "push" data when it thinks it can, such as when there is only one reader rank (likely to consume all written data), or when the LockReaderSelections()/LockWriterDefinitions() have been called to indicate that the data distribution pattern will not change.) Similar things haven't been implemented yet for RDMA...

AaronV77 commented 4 years ago

So I assume is you are using either data plane that uses TCP, that the data will be chunked out to fit within a TCP packet until all data has been transferred to a reader.

eisenhauer commented 4 years ago

Yeah, we may do a write() on a large block, but TCP is going to packetize. The only thing we really affect is whether or not it happens in EndStep (synchronous push model) or if it's queued until requested (EndStep() is just queueing, write() happens in a network handler thread).

AaronV77 commented 4 years ago

Alright, this information will be great in our upcoming papers! Someone in our HPC office trying to find someone to install the new software. I will update this issue once I have everything running and see what happens! Talk to you in a bit @eisenhauer.

AaronV77 commented 4 years ago

@eisenhauer they have finally installed ADIOS2 2.6.0 onto our Cray Supercomputer and the libraries that get created from Cmake are different from the 2.4.0 version. Originally I was using the basic -ladios2 in the linking portion of my compiling but the newer version of ADIOS2 2.6.0 does not include that library anymore, or is this an error from our HPC staff? I couldn't find anything in the documentation describing the new libraries and if I should be linking to something different know. In some contexts, I am using C and MPI.

Side question. In the listed libraries in 2.6.0 below, there are none for SST. Is this due to the SST engine not being enabled in the Cmake build or did you guys change the name or wrap it into another library? I need to know so that if they messed something up that I can also inform them to enable the SST feature. Thanks!

Here is a list of the different objects varying between the two versions. 2.4.0:

cmake               
libadios2_atl.so.2.2.1    
libadios2_cmselect.so   
libadios2_dill.so        
libadios2_enet.so         
libadios2_evpath.so  
libadios2_ffs.so.1.6.0  
libadios2.so.2      
libadios2_sst.so.2
libadios2_atl.so    
libadios2_cmenet.so       
libadios2_cmsockets.so  
libadios2_dill.so.2      
libadios2_enet.so.1       
libadios2_ffs.so     
libadios2_f.so          
libadios2.so.2.4.0  
libadios2_sst.so.2.4.0
libadios2_atl.so.2  
libadios2_cmmulticast.so  
libadios2_cmudp.so      
libadios2_dill.so.2.4.0  
libadios2_enet.so.1.3.14  
libadios2_ffs.so.1   
libadios2.so           
libadios2_sst.so    
libtaustubs.so

2.6.0:

cmake                     
libadios2_c_mpi.so        
libadios2_core_mpi.so        
libadios2_c.so.2              
libadios2_cxx11.so.2.6.0  
libadios2_evpath.so             
libadios2_fortran.so
libadios2_atl.so          
libadios2_c_mpi.so.2      
libadios2_core_mpi.so.2     
 libadios2_c.so.2.6.0          
libadios2_dill.so        
 libadios2_ffs.so                
libadios2_fortran.so.2
libadios2_atl.so.2        
libadios2_c_mpi.so.2.6.0  
libadios2_core_mpi.so.2.6.0  
libadios2_cxx11_mpi.so        
libadios2_dill.so.2       
libadios2_ffs.so.1             
libadios2_fortran.so.2.6.0
libadios2_atl.so.2.2.1    
libadios2_cmselect.so     
libadios2_core.so            
libadios2_cxx11_mpi.so.2      
libadios2_dill.so.2.4.1   
libadios2_ffs.so.1.6.0          
libadios2_taustubs.so
libadios2_cmenet.so       
libadios2_cmsockets.so    
libadios2_core.so.2          
libadios2_cxx11_mpi.so.2.6.0  
libadios2_enet.so         
libadios2_fortran_mpi.so        
python3.4
libadios2_cmepoll.so      
libadios2_cmudp.so        
libadios2_core.so.2.6.0      
libadios2_cxx11.so            
libadios2_enet.so.1       
libadios2_fortran_mpi.so.2
libadios2_cmmulticast.so  
libadios2_cmzplenet.so    
libadios2_c.so               
libadios2_cxx11.so.2          
libadios2_enet.so.1.3.14  
libadios2_fortran_mpi.so.2.6.0
eisenhauer commented 4 years ago

Lots of the library stuff got restructured recently so add MPI-related features (allow MPI and non-MPI linking on one build, etc.). So probably this library arrangment is OK. Just doing -ladios2 might work, but the canonical approach is to use CMake, or if you're not using cmake to use the output of adios2_config to supply flags for compilation and linking. Info is here: https://adios2.readthedocs.io/en/latest/setting_up/setting_up.html#linking-adios-2

pnorbert commented 4 years ago

Oops, the document is a bit outdated. '--cxx-flags' not 'cxxflags'

$ adios2-config -h
adios2-config [OPTION]
  -h, --help       Display help information
  -v, --version    Display version information
  -c               Both compile and link flags for the C bindings
  --c-flags        Preprocessor and compile flags for the C bindings
  --c-libs         Linker flags for the C bindings
  -x, --cxx        Both compile and link flags for the C++ bindings
  --cxx-flags      Preprocessor and compile flags for the C++ bindings
  --cxx-libs       Linker flags for the C++ bindings
  -f, --fortran    Both compile and link flags for the F90 bindings
  --fortran-flags  Preprocessor and compile flags for the F90 bindings
  --fortran-libs   Linker flags for the F90 bindings
  -s, --serial     Select flags for serial applications
  -m, --mpi        Select flags for mpi applications
AaronV77 commented 4 years ago

Sorry guys that it took a whole weekend to get back to you but I could not get into our HPC last week. I had to dig to find the command that was mentioned above and got an undefined reference. Do you guys know where the adios2_init function might live?


(base) valoroso@onyx11:/p/work/valoroso/test_runner/rt-sensor/src/sensor_system> make onyx
cc -pedantic -g3 -Wall -std=c99 -fpic -o main plugin_manager.o sensor_setup.o queue.o common.o parser.o main.o adios2.o timer.o raster.o -ldl -W -export-dynamic -dynamic -lm /usr/lib64/libpng16.so.16 -L/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64 /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0 /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c.so.2.6.0
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: sensor_setup.o: in function `compute_sensor_array':
/p/work/valoroso/test_runner/rt-sensor/src/sensor_system/sensor_setup.c:537: undefined reference to `adios2_init'
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: main.o: in function `master':
/p/work/valoroso/test_runner/rt-sensor/src/sensor_system/main.c:445: undefined reference to `adios2_init'
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: main.o: in function `workers':
/p/work/valoroso/test_runner/rt-sensor/src/sensor_system/main.c:788: undefined reference to `adios2_init'
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: /p/work/valoroso/test_runner/rt-sensor/src/sensor_system/main.c:756: undefined reference to `adios2_init'
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: adios2.o: in function `adios_writer_sst':
/p/work/valoroso/test_runner/rt-sensor/src/sensor_system/adios2.c:172: undefined reference to `adios2_init'
/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: adios2.o:/p/work/valoroso/test_runner/rt-sensor/src/sensor_system/adios2.c:241: more undefined references to `adios2_init' follow
Makefile:54: recipe for target 'main' failed
make: *** [main] Error 1``
eisenhauer commented 4 years ago

Hmm. Something is odd here. Another side effect of the MPI sorting I mentioned above is that the old adios2_init() call is broken out into adios2_init_mpi() and adios2_init_serial. adios2_init() is maintained for backwards compatibility, but it's a macro in the adios2_c_adios.h file (where it is either defined as adios2_init_mpi() or adios2_init_serial() depending upon how ADIOS was built. Is it possible that your compilation is finding a pre 2.6.0 adios2_c.h include file?

AaronV77 commented 4 years ago

Okay, so I checked both files and got the linkage for each shared object. I don't see any older version of ADIOS2 within the links that each shared points too. I'm also not really good at this, so if there is a better way of investigating the shared objects then please let me know. Also, since I am linking directly to the shared objects, there is no middle man here that might be causing anything.

libadios2_c_mpi.so.2.6.0:

(base) valoroso@onyx01:/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64> ldd libadios2_c_mpi.so.2.6.0 
    linux-vdso.so.1 (0x00007ffee9bfc000)
    libadios2_c.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c.so.2 (0x00007f15859d1000)
    libadios2_core_mpi.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_core_mpi.so.2 (0x00007f1565800000)
    libadios2_core.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_core.so.2 (0x00007f1544d20000)
    libintlc.so.5 => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00007f1544aae000)
    libcuda.so.1 => /opt/cray/nvidia/default/lib64/libcuda.so.1 (0x00007f1543b33000)
    librca.so.0 => /opt/cray/rca/2.2.18-6.0.7.1_5.48__g2aa4f39.ari/lib64/librca.so.0 (0x00007f154392f000)
    libhugetlbfs.so => /usr/lib64/libhugetlbfs.so (0x00007f15436f9000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f154336f000)
    libimf.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libimf.so (0x00007f1542dcf000)
    libsvml.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libsvml.so (0x00007f154142c000)
    libirng.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libirng.so (0x00007f15410ba000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f1540dbd000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f1540ba5000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f1540800000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f15405fc000)
    libadios2_taustubs.so => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_taustubs.so (0x00007f15205f7000)
    libmpich_intel.so.3 => /opt/cray/pe/mpt/7.6.3/gni/mpich-intel/16.0/lib/libmpich_intel.so.3 (0x00007f152003c000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f151fe1f000)
    libbz2.so.1 => /usr/lib64/libbz2.so.1 (0x00007f151fc10000)
    libpng16.so.16 => /usr/lib64/libpng16.so.16 (0x00007f151f9d3000)
    libadios2_evpath.so => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_evpath.so (0x00007f14ff92b000)
    libadios2_ffs.so.1 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_ffs.so.1 (0x00007f14df8af000)
    libadios2_atl.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_atl.so.2 (0x00007f14bf89f000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f14bf689000)
    libnetcdf_c++4_intel.so.1 => /opt/cray/pe/netcdf/4.6.3.0/intel/18.0/lib/libnetcdf_c++4_intel.so.1 (0x00007f14bf429000)
    libhdf5_intel.so.103 => /opt/cray/pe/hdf5/1.10.5.0/intel/18.0/lib/libhdf5_intel.so.103 (0x00007f14bed86000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f15a5a15000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f14beb7e000)
    libnvidia-fatbinaryloader.so.396.44 => /opt/cray/nvidia/default/lib64/libnvidia-fatbinaryloader.so.396.44 (0x00007f14be932000)
    libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007f14be72f000)
    libugni.so.0 => /opt/cray/ugni/default/lib64/libugni.so.0 (0x00007f14be4b2000)
    libudreg.so.0 => /opt/cray/udreg/default/lib64/libudreg.so.0 (0x00007f14be2a8000)
    libpmi.so.0 => /opt/cray/pe/lib64/libpmi.so.0 (0x00007f14be062000)
    libifport.so.5 => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libifport.so.5 (0x00007f14bde34000)
    libifcore.so.5 => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libifcore.so.5 (0x00007f14bdad7000)
    libadios2_dill.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_dill.so.2 (0x00007f149daa0000)
    libnetcdf_intel.so.15 => /opt/cray/pe/lib64/libnetcdf_intel.so.15 (0x00007f149d759000)
    libhdf5_hl_intel.so.100 => /opt/cray/pe/lib64/libhdf5_hl_intel.so.100 (0x00007f149d530000)
    libhdf5_intel.so.101 => /opt/cray/pe/lib64/libhdf5_intel.so.101 (0x00007f149ce8f000)

libadios2_c.so.2.6.0

    linux-vdso.so.1 (0x00007ffda91eb000)
    libadios2_core.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_core.so.2 (0x00007f4e1961b000)
    libintlc.so.5 => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00007f4e193a9000)
    libcuda.so.1 => /opt/cray/nvidia/default/lib64/libcuda.so.1 (0x00007f4e1842e000)
    librca.so.0 => /opt/cray/rca/2.2.18-6.0.7.1_5.48__g2aa4f39.ari/lib64/librca.so.0 (0x00007f4e1822a000)
    libhugetlbfs.so => /usr/lib64/libhugetlbfs.so (0x00007f4e17ff4000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f4e17c6a000)
    libimf.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libimf.so (0x00007f4e176ca000)
    libsvml.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libsvml.so (0x00007f4e15d27000)
    libirng.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin/libirng.so (0x00007f4e159b5000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f4e156b8000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f4e154a0000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f4e150fb000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f4e14ef7000)
    libbz2.so.1 => /usr/lib64/libbz2.so.1 (0x00007f4e14ce8000)
    libpng16.so.16 => /usr/lib64/libpng16.so.16 (0x00007f4e14aab000)
    libadios2_taustubs.so => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_taustubs.so (0x00007f4df4aa6000)
    libadios2_evpath.so => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_evpath.so (0x00007f4dd49fe000)
    libadios2_ffs.so.1 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_ffs.so.1 (0x00007f4db4982000)
    libadios2_atl.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_atl.so.2 (0x00007f4d94972000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f4d9475c000)
    libnetcdf_c++4_intel.so.1 => /opt/cray/pe/netcdf/4.6.3.0/intel/18.0/lib/libnetcdf_c++4_intel.so.1 (0x00007f4d944fc000)
    libhdf5_intel.so.103 => /opt/cray/pe/hdf5/1.10.5.0/intel/18.0/lib/libhdf5_intel.so.103 (0x00007f4d93e59000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4d93c3c000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f4e3a0fb000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f4d93a34000)
    libnvidia-fatbinaryloader.so.396.44 => /opt/cray/nvidia/default/lib64/libnvidia-fatbinaryloader.so.396.44 (0x00007f4d937e8000)
    libadios2_dill.so.2 => /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_dill.so.2 (0x00007f4d737b1000)
    libnetcdf_intel.so.15 => /opt/cray/pe/lib64/libnetcdf_intel.so.15 (0x00007f4d7346a000)
    libhdf5_hl_intel.so.100 => /opt/cray/pe/lib64/libhdf5_hl_intel.so.100 (0x00007f4d73241000)
    libhdf5_intel.so.101 => /opt/cray/pe/lib64/libhdf5_intel.so.101 (0x00007f4d72ba0000)
eisenhauer commented 4 years ago

So, this is almost certainly a compile-time problem, rather than a link-time problem, because adios2_init won't live in any of those new libraries. Two things to check. Please make sure that you deleted all the old .o files from your project and rebuilt from scratch. Anything leftover that was compiled with 2.4.0 include files would cause problems. If that doesn't fix things, then please do "adios2_config -c", look for a -I (flag in the output), and do an 'ls' on the directory associated with that -I flag.

AaronV77 commented 4 years ago

Good call on the cleaning of the .o files because I just assumed Make would recompile if the Make file changed. I had to pass in the include path as well to the Make file. The last issue that I am getting is the following:

sensor_setup.c(537): error #55: too many arguments in invocation of macro "adios2_init"
      adios2_adios * adios = adios2_init(MPI_COMM_SELF, adios2_debug_mode_on);

Has the macro changed the number of acceptable arguments?

eisenhauer commented 4 years ago

Hmm. There may be an issue with the ADIOS2 build your guys have installed, but lets try something. Can you replace that line with this one: adios2_adios * adios = adios2_init_mpi(MPI_COMM_SELF); (I.E, drop the debug mode argument (it's been deprecated), and add _mpi to the subroutine name...)

AaronV77 commented 4 years ago

I'm getting the current warnings/errors when I update the function usage:

sensor_setup.c(537): warning #266: function "adios2_init_mpi" declared implicitly
      adios2_adios * adios = adios2_init_mpi(MPI_COMM_SELF);
                             ^

sensor_setup.c(537): warning #144: a value of type "int" cannot be used to initialize an entity of type "adios2_adios *"
      adios2_adios * adios = adios2_init_mpi(MPI_COMM_SELF);
                             ^

icc: warning #10145: no action performed for file '/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0'
icc: warning #10145: no action performed for file '/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c.so.2.6.0'

In my header files, I am including #include "adios2_c.h"

eisenhauer commented 4 years ago

No, I think you're doing the right thing, but your version of adios has not been compiled with MPI support. If it had been, the adios_init() macro would have two arguments and adios2_init_mpi() should be defined. @bradking or @chuckatkins , do you think this is a reasonable inference?

AaronV77 commented 4 years ago

So @eisenhauer do you think I should just install ADIOS2 into my local directory or try to get the HPC office to resolve the issues?

bradking commented 4 years ago

See the definition of adios2_init_mpi here. It is conditioned on ADIOS2_USE_MPI, which must be defined by the consuming application if it wants to use the MPI support. That macro is defined automatically by ADIOS2's officially supported methods for consumption:

eisenhauer commented 4 years ago

Ah, that's what I was missing. @AaronV77 , this is a change in ADIOS 2.6.0 in that the simple way you were including ADIOS in your application in 2.4.0 no longer works, with particular implications for the C bindings. If you were using the output of 'adios2-config --c-flags', you'd be getting the appropriate -I for the adios2.h include and the -DADIOS2_USE_MPI. I hadn't really focused on this because polymorphism lets it be a bit nicer in C++.

AaronV77 commented 4 years ago

Alright, so I have to pass that flag even though the library was compiled with MPI, kind of quirky. By adding that flag everything compiles but I am still receiving the following warning:

icc: warning #10145: no action performed for file '/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0'
icc: warning #10145: no action performed for file '/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c.so.2.6.0'

I've never seen this warning before, have either of you? Lastly, this might be a personal miss understanding but by having the following:

cc -std=c99 -fpic -c sstr.c -DADIOS2_USE_MPI -isystem /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/include -Wl,-rpath,/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64 /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0 /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c.so.2.6.0 -Wl,-rpath-link,/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64   

This compiling works and produces the warning at the very top of this comment, but only seems to work when compiling against my library. If I want to use ADIOS2 for a simple program that has a main and to just test things out, I get the following:

/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: attempted static link of dynamic object `/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0'

I realized that this error is caused by removing the -c option at the beginning of my compiling for the simple program. Again I believe that this is a personal miss understanding and not an ADIOS2 related issue now.

bradking commented 4 years ago

I have to pass that flag even though the library was compiled with MPI, kind of quirky.

The reason is that starting with version 2.6, ADIOS2 now supports building serial applications against an ADIOS2 that was built with MPI support enabled. The choice is now made at the time the application builds rather than the time ADIOS2 builds.

As for the warning, compiling with cc ... -c ... says to compile to an object file and not to link. Passing library files to that command does indeed do nothing with them, hence the warning. You can take off -c.

AaronV77 commented 4 years ago

@bradking I'm going to have to look more into serial applications because this is the first time I've ever heard of such a thing. As for the warning if I remove the -c flag I will get the following error:

/usr/lib64/gcc/x86_64-suse-linux/4.8/../../../../x86_64-suse-linux/bin/ld: attempted static link of dynamic object `/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0'
bradking commented 4 years ago

You can read "serial applications" as simply "applications not using MPI". In ADIOS2 2.5 and below, if ADIOS2 was built with MPI support then the application had to be built with MPI as well, and vice versa. Now applications have the choice.

Please post your complete command line invocation without -c so we can try to identify what causes that error.

AaronV77 commented 4 years ago

My apologies for not including the whole line:

cc -pedantic -Wall -std=c99 -fpic sensor_setup.c -DADIOS2_USE_MPI -isystem /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/include -Wl,-rpath,/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64 /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c_mpi.so.2.6.0 /p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64/libadios2_c.so.2.6.0 -Wl,-rpath-link,/p/app/unsupported/adios/2.6.0-intel-19.0.1.144/lib64
bradking commented 4 years ago

That command line looks okay. The error looks like a Cray thing. Try export CRAYPE_LINK_TYPE=dynamic to tell the Cray tools it is okay to link a dynamic executable.

chuckatkins commented 4 years ago

@AaronV77 we've never actually supported directly using -I/path/to/adios/include -ladios2, it just often happened to work. As of 2.6.0 an MPI build of ADIOS produces both an MPI and serial version of the library and using the supported adios2-config script provides the necessary flags to use either. The default behavior of adios2-config will enable MPI if built with MPI and disable it if not but you can also explicitly select which option you choose to use:

$ adios2-config 
adios2-config [OPTION]
  -h, --help       Display help information
  -v, --version    Display version information
  -c               Both compile and link flags for the C bindings
  --c-flags        Preprocessor and compile flags for the C bindings
  --c-libs         Linker flags for the C bindings
  -x, --cxx        Both compile and link flags for the C++ bindings
  --cxx-flags      Preprocessor and compile flags for the C++ bindings
  --cxx-libs       Linker flags for the C++ bindings
  -f, --fortran    Both compile and link flags for the F90 bindings
  --fortran-flags  Preprocessor and compile flags for the F90 bindings
  --fortran-libs   Linker flags for the F90 bindings
  -s, --serial     Select flags for serial applications
  -m, --mpi        Select flags for mpi applications

So to get just the compile flags for the adios MPI C bindings:

$ adios2-config -m --c-flags
-DADIOS2_USE_MPI -isystem /home/khq.kitware.com/chuck.atkins/Code/adios2/install/master/include 

And to get the link flags:

$ adios2-config -m --c-libs
-Wl,-rpath,/home/khq.kitware.com/chuck.atkins/Code/adios2/install/master/lib64 /home/khq.kitware.com/chuck.atkins/Code/adios2/install/master/lib64/libadios2_c_mpi.so.2.6.0 /home/khq.kitware.com/chuck.atkins/Code/adios2/install/master/lib64/libadios2_c.so.2.6.0 -Wl,-rpath-link,/home/khq.kitware.com/chuck.atkins/Code/adios2/install/master/lib64

The preferred way to use these in your compilation would be:

cc $(adios2-config -m --c-flags) -o foo.o -c foo.c
cc $(adios2-config -m --c-libs) -o foo foo.o

or in a single step:

cc $(adios2-config -m -c) -o foo foo.c

Note that the addition of -m/-s is optional as it will default to whether or not adios was built with or without MPI, but can always be added to be explicit.

This is the same pattern use by pkg-config to provide necessary usage flags from .pc files; rather than use the flags directly you instead directly use the output of the command that produces the flags.

AaronV77 commented 4 years ago

@bradking I got everything linking and working correctly now! I owe you a round of root beers on the Cray variable, I would have never have found that.

@eisenhauer I have to rebuild and fix some issues with the application to see if I can repeat the issues found in version 2.4.0 (hopefully they don't pop up). So give me like a day or two to get that working and respond.

@chuckatkins of course I would find a feature that just happened to be supported! lol, thanks for the comment.

chuckatkins commented 4 years ago

@chuckatkins if the adios2-config is not able to be found by default on the HPC would it be better to just link the command as a Bash alias or would this work $(/path/to/adiso2-config -m -c)?

Using the full path is fine, or an alias, really whatever works for you in your environment. It shouldn't affect the output.

AaronV77 commented 4 years ago

@eisenhauer after getting everything re-implemented and the workflow adjusted to use ADIOS2 2.6.0, I am still having issues where SST hangs. I went back through the comments to look for the "SstVerbose" variable you mentioned and couldn't find anything in the documentation about it. What do I set this variable as and how can I receive the debug information? Is it displayed to the screen or do I have to set up the adios2 debug system?

eisenhauer commented 4 years ago

SstVerbose is an environment variable you can set to anything not empty. Output comes to stdio (or stderr).

AaronV77 commented 4 years ago

To save some screen space I've included the last couple of lines from the output from the SST engine. One thing I noticed is that the reader uses "processor_comms3.sst" and the writer uses "processor_comms3". In my application and the one, I am communicating with the use of "processor_comms#" to signify the communication channel name. I don't add the ".sst" to either of them. I could email you the full output if needed.

Reader:

DP Reader 0 (0x1000555c970): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x1000555c970): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x1000555c970): Looking for writer contact in file processor_comms3.sst, with timeout 60 secs
Reader 0 (0x1000555c970): Waiting for writer response message in SstReadOpen("processor_comms3")
aprun: Apid 18291905: Caught signal Terminated, sending to application

Writer:

Writer 0 (0x2aabe76031a0): Stream "processor_comms3" waiting for 1 readers
aprun: Apid 18291904: Caught signal Terminated, sending to application
_pmiu_daemon(SIGCHLD): [NID 04951] [c11-1c2s5n3] [Thu Jun 25 14:02:45 2020] PE RANK 11 exit signal Terminated
eisenhauer commented 4 years ago

Hmm. First let me say that not adding .sst to either name is fine. As long as you use the same string for open in the reader and writer you should be fine.

I'm probably going to need to see more of the output from the readers and writers because I'm not clear what's going on. It looks like the reader id finding the processor_comms3.sst file, initiating a connection to the writer and sending a join message, but the writer doesn't seem to be receiving it. Offhand, I don't know any reason why that might be happening, but I'm intrigued by your mentioning "processor_comms#". Are you opening several streams here? Maybe best to attach all the output from the reader and writer as files... That might help me make sense of things... (Also, I'm not very available tomorrow, so will respond when I can.)

AaronV77 commented 4 years ago

@eisenhauer I've bothered you enough the last couple of days, so you respond whenever you can. I appreciate all the help and information that I have received from you guys. Yes I am opening several streams at once with different communication channel names. Alright so here is all the output:

reader:

Reader 0 (0x2aabdc001aa0): Sst set to use sockets as a Control Transport
DP Reader 0 (0x2aabdc001aa0): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x2aabdc001aa0): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x2aabdc001aa0): Looking for writer contact in file master_comms.sst, with timeout 60 secs
Reader 0 (0x2aabdc001aa0): Waiting for writer response message in SstReadOpen("master_comms")
Reader 0 (0x2aabdc001aa0): finished wait writer response message in read_open
Reader 0 (0x2aabdc001aa0): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabdc001aa0): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabdc001aa0): Writer is doing BP-based marshalling
Reader 0 (0x2aabdc001aa0): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x2aabdc001aa0): Received contact info "Writer Rank 0, test contact", WS_stream 0x10000091f80 for WSR Rank 0
Reader 0 (0x2aabdc001aa0): Sending Reader Activate messages to writer
Reader 0 (0x2aabdc001aa0): Finish opening Stream "master_comms", starting with Step number 0
Reader 0 (0x2aabdc001aa0): Wait for next metadata after last timestep -1
Reader 0 (0x2aabdc001aa0): Waiting for metadata for a Timestep later than TS -1
Reader 0 (0x2aabdc001aa0): (PID 966a, TID 2aabdb053700) Stream status is Established
DP Reader 0 (0x2aabdc001aa0): Got a preload message from writer rank 0 for timestep 0
Reader 0 (0x2aabdc001aa0): Received a Timestep metadata message for timestep 0, signaling condition
DP Reader 0 (0x2aabdc001aa0): Got a preload message from writer rank 0 for timestep 1
Reader 0 (0x2aabdc001aa0): Received a Timestep metadata message for timestep 1, signaling condition
Reader 0 (0x2aabdc001aa0): Received a writer close message. Timestep 1 was the final timestep.
Reader 0 (0x2aabdc001aa0): Examining metadata for Timestep 0
Reader 0 (0x2aabdc001aa0): Returning metadata for Timestep 0
Reader 0 (0x2aabdc001aa0): Setting TSmsg to Rootentry value
DP Reader 0 (0x2aabdc001aa0): EVPATH registering reader arrival of TS 0 metadata, preload mode 1
DP Reader 0 (0x2aabdc001aa0): EVPATH registering reader arrival of TS 1 metadata, preload mode 1
Reader 0 (0x2aabdc001aa0): SstAdvanceStep returning Success on timestep 0
DP Reader 0 (0x2aabdc001aa0): Satisfying remote memory read with preload from writer rank 0 for timestep 0
Reader 0 (0x2aabdc001aa0): Sending ReleaseTimestep message for timestep 0, one to each writer
Reader 0 (0x2aabdc001aa0): Wait for next metadata after last timestep 0
Reader 0 (0x2aabdc001aa0): Examining metadata for Timestep 1
Reader 0 (0x2aabdc001aa0): Returning metadata for Timestep 1
Reader 0 (0x2aabdc001aa0): Setting TSmsg to Rootentry value
Reader 0 (0x2aabdc001aa0): SstAdvanceStep returning Success on timestep 1
DP Reader 0 (0x2aabdc001aa0): Satisfying remote memory read with preload from writer rank 0 for timestep 1
Reader 0 (0x2aabdc001aa0): Sending ReleaseTimestep message for timestep 1, one to each writer
Reader 0 (0x2aabdc001aa0): Reader-side close handler invoked
Reader 0 (0x2aabdc001aa0): Reader-side Rank received a connection-close event after close, not unexpected
Reader 0 (0x2aabdc001aa0): Destroying stream 0x2aabdc001aa0, name master_comms
Reader 0 (0x2aabdc001aa0): Reference count now zero, Destroying process SST info cache
Reader 0 (0x2aabdc001aa0): Freeing LastCallList
Reader 0 (0x2aabdb0526e0): SstStreamDestroy successful, returning
Reader 0 (0x2aabe75698e0): Sst set to use sockets as a Control Transport
DP Reader 0 (0x2aabe75698e0): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x2aabe75698e0): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x2aabe75698e0): Looking for writer contact in file processor_comms1.sst, with timeout 60 secs
Reader 0 (0x2aabe75698e0): Waiting for writer response message in SstReadOpen("processor_comms1")
Reader 0 (0x2aabe75698e0): finished wait writer response message in read_open
Reader 0 (0x2aabe75698e0): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabe75698e0): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabe75698e0): Writer is doing BP-based marshalling
Reader 0 (0x2aabe75698e0): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x2aabe75698e0): Received contact info "Writer Rank 0, test contact", WS_stream 0x10005549d70 for WSR Rank 0
Reader 0 (0x2aabe75698e0): Sending Reader Activate messages to writer
Reader 0 (0x2aabe75698e0): Finish opening Stream "processor_comms1", starting with Step number 0
Writer 0 (0x2aabe757e720): Sst set to use sockets as a Control Transport
DP Writer 0 (0x2aabe757e720): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x2aabe757e720): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x2aabe757e720): Opening Stream "processor_comms1"
Writer 0 (0x2aabe757e720): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x2aabe757e720): Stream "processor_comms1" waiting for 1 readers
Writer 0 (0x2aabe757e720): Beginning writer-side reader open protocol
DP Writer 0 (0x2aabe757e720): Received contact info "AAIAAJTJ8o2LjwAAATkCmH8TgAo=", RD_Stream 0x1000555d9b0 for Reader Rank 0
Writer 0 (0x2aabe757e720): Setting SpeculativePreload ON for new reader
Writer 0 (0x2aabe757e720): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x2aabe757e720): Finish writer-side reader open protocol for reader 0x2aabe757f8f0, reader ready response pending
Writer 0 (0x2aabe757e720): (PID 966a, TID 2aabdb254700) Waiting for Reader ready on WSR 0x2aabe757f8f0.
Writer 0 (0x2aabe757e720): Reader Activate message received for Stream 0x2aabe757f8f0.  Setting state to Established.
Writer 0 (0x2aabe757e720): Parent stream reader count is now 1.
Writer 0 (0x2aabe757e720): Reader ready on WSR 0x2aabe757f8f0, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x2aabe757e720): Finish opening Stream "processor_comms1"
Reader 0 (0x2aabe7583990): Sst set to use sockets as a Control Transport
DP Reader 0 (0x2aabe7583990): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x2aabe7583990): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x2aabe7583990): Looking for writer contact in file processor_comms2.sst, with timeout 60 secs
Reader 0 (0x2aabe7583990): Waiting for writer response message in SstReadOpen("processor_comms2")
DP Reader 0 (0x2aabe75698e0): Got a preload message from writer rank 0 for timestep 0
Reader 0 (0x2aabe75698e0): Received a Timestep metadata message for timestep 0, signaling condition
Reader 0 (0x2aabe7583990): finished wait writer response message in read_open
Reader 0 (0x2aabe7583990): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabe7583990): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabe7583990): Writer is doing BP-based marshalling
Reader 0 (0x2aabe7583990): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x2aabe7583990): Received contact info "Writer Rank 0, test contact", WS_stream 0x100055490b0 for WSR Rank 0
Reader 0 (0x2aabe7583990): Sending Reader Activate messages to writer
Reader 0 (0x2aabe7583990): Finish opening Stream "processor_comms2", starting with Step number 0
Writer 0 (0x2aabe7584c40): Sst set to use sockets as a Control Transport
DP Writer 0 (0x2aabe7584c40): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x2aabe7584c40): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x2aabe7584c40): Opening Stream "processor_comms2"
Writer 0 (0x2aabe7584c40): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x2aabe7584c40): Stream "processor_comms2" waiting for 1 readers
Writer 0 (0x2aabe7584c40): Beginning writer-side reader open protocol
DP Writer 0 (0x2aabe7584c40): Received contact info "AAIAAJTJ8o1wiAAAATkCmH8TgAo=", RD_Stream 0x100055588b0 for Reader Rank 0
Writer 0 (0x2aabe7584c40): Setting SpeculativePreload ON for new reader
Writer 0 (0x2aabe7584c40): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x2aabe7584c40): Finish writer-side reader open protocol for reader 0x2aabe75842d0, reader ready response pending
Writer 0 (0x2aabe7584c40): (PID 966a, TID 2aabdb254700) Waiting for Reader ready on WSR 0x2aabe75842d0.
Writer 0 (0x2aabe7584c40): Reader Activate message received for Stream 0x2aabe75842d0.  Setting state to Established.
Writer 0 (0x2aabe7584c40): Parent stream reader count is now 1.
Writer 0 (0x2aabe7584c40): Reader ready on WSR 0x2aabe75842d0, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x2aabe7584c40): Finish opening Stream "processor_comms2"
Reader 0 (0x2aabe7587f40): Sst set to use sockets as a Control Transport
DP Reader 0 (0x2aabe7587f40): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x2aabe7587f40): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x2aabe7587f40): Looking for writer contact in file processor_comms3.sst, with timeout 60 secs
Reader 0 (0x2aabe7587f40): Waiting for writer response message in SstReadOpen("processor_comms3")
DP Reader 0 (0x2aabe7583990): Got a preload message from writer rank 0 for timestep 0
Reader 0 (0x2aabe7583990): Received a Timestep metadata message for timestep 0, signaling condition
Reader 0 (0x2aabe7587f40): finished wait writer response message in read_open
Reader 0 (0x2aabe7587f40): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabe7587f40): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x2aabe7587f40): Writer is doing BP-based marshalling
Reader 0 (0x2aabe7587f40): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x2aabe7587f40): Received contact info "Writer Rank 0, test contact", WS_stream 0x100055490c0 for WSR Rank 0
Reader 0 (0x2aabe7587f40): Sending Reader Activate messages to writer
Reader 0 (0x2aabe7587f40): Finish opening Stream "processor_comms3", starting with Step number 0
Writer 0 (0x2aabe76031a0): Sst set to use sockets as a Control Transport
DP Writer 0 (0x2aabe76031a0): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x2aabe76031a0): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x2aabe76031a0): Opening Stream "processor_comms3"
Writer 0 (0x2aabe76031a0): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x2aabe76031a0): Stream "processor_comms3" waiting for 1 readers
aprun: Apid 18291904: Caught signal Terminated, sending to application
_pmiu_daemon(SIGCHLD): [NID 04951] [c11-1c2s5n3] [Thu Jun 25 14:02:45 2020] PE RANK 11 exit signal Terminated

writer:

Writer 0 (0x100000886c0): Sst set to use sockets as a Control Transport
DP Writer 0 (0x100000886c0): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x100000886c0): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x100000886c0): Opening Stream "master_comms"
Writer 0 (0x100000886c0): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x100000886c0): Stream "master_comms" waiting for 1 readers
Writer 0 (0x100000886c0): Beginning writer-side reader open protocol
DP Writer 0 (0x100000886c0): Received contact info "AAIAAJTJ8o0ThQAAATkCmH4TgAo=", RD_Stream 0x2aabdc00ab90 for Reader Rank 0
Writer 0 (0x100000886c0): Setting SpeculativePreload ON for new reader
Writer 0 (0x100000886c0): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x100000886c0): Finish writer-side reader open protocol for reader 0x10000091a30, reader ready response pending
Writer 0 (0x100000886c0): (PID ea25, TID 2aaaaab5bfc0) Waiting for Reader ready on WSR 0x10000091a30.
Writer 0 (0x100000886c0): Reader Activate message received for Stream 0x10000091a30.  Setting state to Established.
Writer 0 (0x100000886c0): Parent stream reader count is now 1.
Writer 0 (0x100000886c0): Reader ready on WSR 0x10000091a30, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x100000886c0): Finish opening Stream "master_comms"
Writer 0 (0x100000886c0): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Writer 0 (0x100000886c0): Sent timestep 0 to reader cohort 0
Writer 0 (0x100000886c0): ADDING timestep 0 to sent list for reader cohort 0, READER 0x10000091a30, reference count is now 2
Writer 0 (0x100000886c0): PRELOADMODE for timestep 0 non-default for reader , active at timestep 0, mode 1
DP Writer 0 (0x100000886c0): Per reader registration for timestep 0, preload mode 1
DP Writer 0 (0x100000886c0): Sending Speculative Preload messages, reader 0x10000091f80, timestep 0
Writer 0 (0x100000886c0): Sending a message to reader 0 (0x2aabdc001aa0)
Writer 0 (0x100000886c0): SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Writer 0 (0x100000886c0): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = -1, count = 2
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): Sending TimestepMetadata for timestep 1 (ref count 1), one to each reader
Writer 0 (0x100000886c0): Sent timestep 1 to reader cohort 0
Writer 0 (0x100000886c0): ADDING timestep 1 to sent list for reader cohort 0, READER 0x10000091a30, reference count is now 2
Writer 0 (0x100000886c0): PRELOADMODE for timestep 1 non-default for reader , active at timestep 0, mode 1
DP Writer 0 (0x100000886c0): Per reader registration for timestep 1, preload mode 1
DP Writer 0 (0x100000886c0): Sending Speculative Preload messages, reader 0x10000091f80, timestep 1
Writer 0 (0x100000886c0): Sending a message to reader 0 (0x2aabdc001aa0)
Writer 0 (0x100000886c0): SubRef : Writer-side Timestep 1 now has reference count 1, expired 0, precious 0
Writer 0 (0x100000886c0): Reader 0 status Established has last released 4294967295, last sent 1
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = -1, count = 2
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): SstWriterClose, Sending Close at Timestep 1, one to each reader
Writer 0 (0x100000886c0): Working on reader cohort 0
Writer 0 (0x100000886c0): Sending a message to reader 0 (0x2aabdc001aa0)
Writer 0 (0x100000886c0): Reader 0 status Established has last released 4294967295, last sent 1
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = -1, count = 2
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x10000091620): Sst set to use sockets as a Control Transport
Writer 0 (0x10000091620): Sst set to use sockets as a Control Transport
Writer 0 (0x10000091620): Sst set to use sockets as a Control Transport
Writer 0 (0x10000092100): Sst set to use sockets as a Control Transport
DP Writer 0 (0x10000091620): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x10000091620): Selecting DataPlane "evpath", priority 1 for use
DP Writer 0 (0x10000091620): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x10000091620): Selecting DataPlane "evpath", priority 1 for use
DP Writer 0 (0x10000092100): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x10000092100): Selecting DataPlane "evpath", priority 1 for use
DP Writer 0 (0x10000091620): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x10000091620): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x10000092100): Opening Stream "processor_comms1"
Writer 0 (0x10000092100): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x10000092100): Stream "processor_comms1" waiting for 1 readers
Writer 0 (0x10000091620): Opening Stream "processor_comms2"
Writer 0 (0x10000091620): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x10000091620): Stream "processor_comms2" waiting for 1 readers
Writer 0 (0x10000091620): Opening Stream "processor_comms3"
Writer 0 (0x10000091620): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x10000091620): Stream "processor_comms3" waiting for 1 readers
Writer 0 (0x10000091620): Opening Stream "processor_comms4"
Writer 0 (0x10000091620): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x10000091620): Stream "processor_comms4" waiting for 1 readers
Writer 0 (0x100000886c0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x100000886c0): IN TS WAIT, ENTRIES are Timestep 1 (exp 0, Prec 0, Ref 1), Count now 2
Writer 0 (0x100000886c0): IN TS WAIT, ENTRIES are Timestep 0 (exp 0, Prec 0, Ref 1), Count now 2
Writer 0 (0x100000886c0): The timesteps still queued are: 1 0 
Writer 0 (0x100000886c0): Reader Count is 1
Writer 0 (0x100000886c0): Reader [0] status is Established
Writer 0 (0x100000886c0): Received a release timestep message for timestep 0 from reader cohort 0
Writer 0 (0x100000886c0): Got the lock in release timestep
Writer 0 (0x100000886c0): Doing dereference sent
Writer 0 (0x100000886c0): Reader sent timestep list 0x100000ac250, trying to release 0
Writer 0 (0x100000886c0): Reader considering sent timestep 0,trying to release 0
Writer 0 (0x100000886c0): SubRef : Writer-side Timestep 0 now has reference count 0, expired 0, precious 0
Writer 0 (0x100000886c0): Doing QueueMaint
Writer 0 (0x100000886c0): Reader 0 status Established has last released 0, last sent 1
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = 0, count = 2
Writer 0 (0x100000886c0): Writer tagging timestep 0 as expired
DP Writer 0 (0x100000886c0): Releasing timestep 0
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 1
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): Releasing the lock in release timestep
Writer 0 (0x100000886c0): Received a release timestep message for timestep 1 from reader cohort 0
Writer 0 (0x100000886c0): Got the lock in release timestep
Writer 0 (0x100000886c0): Doing dereference sent
Writer 0 (0x100000886c0): Reader sent timestep list 0x100000ab150, trying to release 1
Writer 0 (0x100000886c0): Reader considering sent timestep 1,trying to release 1
Writer 0 (0x100000886c0): SubRef : Writer-side Timestep 1 now has reference count 0, expired 0, precious 0
Writer 0 (0x100000886c0): Doing QueueMaint
Writer 0 (0x100000886c0): Reader 0 status Established has last released 1, last sent 1
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = 1, count = 1
Writer 0 (0x100000886c0): Writer tagging timestep 1 as expired
DP Writer 0 (0x100000886c0): Releasing timestep 1
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): Remove queue Entries removing Timestep 1 (exp 1, Prec 0, Ref 0), Count now 0
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): Releasing the lock in release timestep
Writer 0 (0x100000886c0): Reader Close message received for stream 0x10000091a30.  Setting state to PeerClosed and releasing timesteps.
Writer 0 (0x100000886c0): In PeerFailCloseWSReader, releasing sent timesteps
Writer 0 (0x100000886c0): Dereferencing all timesteps sent to reader 0x10000091a30
Writer 0 (0x100000886c0): DONE DEREFERENCING
Writer 0 (0x100000886c0): Moving Reader stream 0x10000091a30 to status PeerClosed
Writer 0 (0x100000886c0): Reader 0 status PeerClosed has last released 1, last sent 1
Writer 0 (0x100000886c0): QueueMaintenance, smallest last released = LONG_MAX, count = 0
Writer 0 (0x100000886c0): Removing dead entries
Writer 0 (0x100000886c0): QueueMaintenance complete
Writer 0 (0x100000886c0): All timesteps are released in WriterClose
Writer 0 (0x100000886c0): Destroying stream 0x100000886c0, name master_comms
Writer 0 (0x100000886c0): Reference count now zero, Destroying process SST info cache
Writer 0 (0x100000886c0): Freeing LastCallList
Writer 0 (0x7fffffff3840): SstStreamDestroy successful, returning
Writer 0 (0x10000092100): Beginning writer-side reader open protocol
DP Writer 0 (0x10000092100): Received contact info "AAIAAJTJ8o1XuAAAATkCmH4TgAo=", RD_Stream 0x2aabe756be50 for Reader Rank 0
Writer 0 (0x10000092100): Setting SpeculativePreload ON for new reader
Writer 0 (0x10000092100): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x10000092100): Finish writer-side reader open protocol for reader 0x10005549cf0, reader ready response pending
Writer 0 (0x10000092100): (PID ea26, TID 2aaaaab5bfc0) Waiting for Reader ready on WSR 0x10005549cf0.
Writer 0 (0x10000092100): Reader Activate message received for Stream 0x10005549cf0.  Setting state to Established.
Writer 0 (0x10000092100): Parent stream reader count is now 1.
Writer 0 (0x10000092100): Reader ready on WSR 0x10005549cf0, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x10000092100): Finish opening Stream "processor_comms1"
Reader 0 (0x1000555d6d0): Sst set to use sockets as a Control Transport
DP Reader 0 (0x1000555d6d0): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x1000555d6d0): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x1000555d6d0): Looking for writer contact in file processor_comms1.sst, with timeout 60 secs
Reader 0 (0x1000555d6d0): Waiting for writer response message in SstReadOpen("processor_comms1")
Reader 0 (0x1000555d6d0): finished wait writer response message in read_open
Reader 0 (0x1000555d6d0): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x1000555d6d0): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x1000555d6d0): Writer is doing BP-based marshalling
Reader 0 (0x1000555d6d0): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x1000555d6d0): Received contact info "Writer Rank 0, test contact", WS_stream 0x2aabe757e260 for WSR Rank 0
Reader 0 (0x1000555d6d0): Sending Reader Activate messages to writer
Reader 0 (0x1000555d6d0): Finish opening Stream "processor_comms1", starting with Step number 0
Writer 0 (0x10000092100): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x10000092100): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x10000092100): Removing dead entries
Writer 0 (0x10000092100): QueueMaintenance complete
Writer 0 (0x10000092100): Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Writer 0 (0x10000092100): Sent timestep 0 to reader cohort 0
Writer 0 (0x10000092100): ADDING timestep 0 to sent list for reader cohort 0, READER 0x10005549cf0, reference count is now 2
Writer 0 (0x10000092100): PRELOADMODE for timestep 0 non-default for reader , active at timestep 0, mode 1
DP Writer 0 (0x10000092100): Per reader registration for timestep 0, preload mode 1
DP Writer 0 (0x10000092100): Sending Speculative Preload messages, reader 0x10005549d70, timestep 0
Writer 0 (0x10000091620): Beginning writer-side reader open protocol
DP Writer 0 (0x10000091620): Received contact info "AAIAAJTJ8o1XuAAAATkCmH4TgAo=", RD_Stream 0x2aabe757a830 for Reader Rank 0
Writer 0 (0x10000091620): Setting SpeculativePreload ON for new reader
Writer 0 (0x10000091620): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x10000091620): Finish writer-side reader open protocol for reader 0x10005548f70, reader ready response pending
Writer 0 (0x10000091620): (PID ea27, TID 2aaaaab5bfc0) Waiting for Reader ready on WSR 0x10005548f70.
Writer 0 (0x10000092100): Sending a message to reader 0 (0x2aabe75698e0)
Writer 0 (0x10000092100): SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Writer 0 (0x10000092100): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x10000092100): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x10000092100): Removing dead entries
Writer 0 (0x10000092100): QueueMaintenance complete
Reader 0 (0x1000555d6d0): Wait for next metadata after last timestep -1
Reader 0 (0x1000555d6d0): Waiting for metadata for a Timestep later than TS -1
Reader 0 (0x1000555d6d0): (PID ea26, TID 2aaaaab5bfc0) Stream status is Established
Writer 0 (0x10000091620): Reader Activate message received for Stream 0x10005548f70.  Setting state to Established.
Writer 0 (0x10000091620): Parent stream reader count is now 1.
Writer 0 (0x10000091620): Reader ready on WSR 0x10005548f70, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x10000091620): Finish opening Stream "processor_comms2"
Reader 0 (0x1000555c9c0): Sst set to use sockets as a Control Transport
DP Reader 0 (0x1000555c9c0): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x1000555c9c0): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x1000555c9c0): Looking for writer contact in file processor_comms2.sst, with timeout 60 secs
Reader 0 (0x1000555c9c0): Waiting for writer response message in SstReadOpen("processor_comms2")
Reader 0 (0x1000555c9c0): finished wait writer response message in read_open
Reader 0 (0x1000555c9c0): Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x1000555c9c0): Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader 0 (0x1000555c9c0): Writer is doing BP-based marshalling
Reader 0 (0x1000555c9c0): Writer is using Minimum Connection Communication pattern (min)
DP Reader 0 (0x1000555c9c0): Received contact info "Writer Rank 0, test contact", WS_stream 0x2aabe7582310 for WSR Rank 0
Reader 0 (0x1000555c9c0): Sending Reader Activate messages to writer
Reader 0 (0x1000555c9c0): Finish opening Stream "processor_comms2", starting with Step number 0
Writer 0 (0x10000091620): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x10000091620): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x10000091620): Removing dead entries
Writer 0 (0x10000091620): QueueMaintenance complete
Writer 0 (0x10000091620): Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Writer 0 (0x10000091620): Sent timestep 0 to reader cohort 0
Writer 0 (0x10000091620): ADDING timestep 0 to sent list for reader cohort 0, READER 0x10005548f70, reference count is now 2
Writer 0 (0x10000091620): PRELOADMODE for timestep 0 non-default for reader , active at timestep 0, mode 1
DP Writer 0 (0x10000091620): Per reader registration for timestep 0, preload mode 1
DP Writer 0 (0x10000091620): Sending Speculative Preload messages, reader 0x100055490b0, timestep 0
Writer 0 (0x10000091620): Sending a message to reader 0 (0x2aabe7583990)
Writer 0 (0x10000091620): SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Writer 0 (0x10000091620): Reader 0 status Established has last released 4294967295, last sent 0
Writer 0 (0x10000091620): QueueMaintenance, smallest last released = -1, count = 1
Writer 0 (0x10000091620): Removing dead entries
Writer 0 (0x10000091620): QueueMaintenance complete
Reader 0 (0x1000555c9c0): Wait for next metadata after last timestep -1
Reader 0 (0x1000555c9c0): Waiting for metadata for a Timestep later than TS -1
Reader 0 (0x1000555c9c0): (PID ea27, TID 2aaaaab5bfc0) Stream status is Established
Writer 0 (0x10000091620): Beginning writer-side reader open protocol
DP Writer 0 (0x10000091620): Received contact info "AAIAAJTJ8o1XuAAAATkCmH4TgAo=", RD_Stream 0x2aabe75860e0 for Reader Rank 0
Writer 0 (0x10000091620): Setting SpeculativePreload ON for new reader
Writer 0 (0x10000091620): My oldest timestep was 0, global oldest timestep was 0
Writer 0 (0x10000091620): Finish writer-side reader open protocol for reader 0x10005548f80, reader ready response pending
Writer 0 (0x10000091620): (PID ea28, TID 2aaaaab5bfc0) Waiting for Reader ready on WSR 0x10005548f80.
Writer 0 (0x10000091620): Reader Activate message received for Stream 0x10005548f80.  Setting state to Established.
Writer 0 (0x10000091620): Parent stream reader count is now 1.
Writer 0 (0x10000091620): Reader ready on WSR 0x10005548f80, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x10000091620): Finish opening Stream "processor_comms3"
Reader 0 (0x1000555c970): Sst set to use sockets as a Control Transport
DP Reader 0 (0x1000555c970): Considering DataPlane "evpath" for possible use, priority is 1
DP Reader 0 (0x1000555c970): Selecting DataPlane "evpath", priority 1 for use
Reader 0 (0x1000555c970): Looking for writer contact in file processor_comms3.sst, with timeout 60 secs
Reader 0 (0x1000555c970): Waiting for writer response message in SstReadOpen("processor_comms3")
aprun: Apid 18291905: Caught signal Terminated, sending to application
_pmiu_daemon(SIGCHLD): [NID 04952] [c11-1c2s6n0] [Thu Jun 25 14:02:45 2020] PE RANK 0 exit signal Terminated
eisenhauer commented 4 years ago

Hmm. Nothing obvious to me from the log, but this is likely to be a previously-unknown race condition and I need to know a bit more. Is this a relatively simple situation, I.E. N open's in a sequence in the writer and N opens in a sequence in the reader? It looks like N is at least 4, possibly larger? If what you're doing is more complex, happy to look at your source too...

AaronV77 commented 4 years ago

That would explain why it is failing at different locations in the testing rather than one spot. There are two applications in our current workflow that use the SST engine. One application is using MPI to process data across N number of processes and each process has to communicate with the other application. There is SST communication at the beginning of the first application that tells the second application how many processes are working on the current job and all of their MPI ranks (un-related to the current issue). The other application has to loop through each process (grab, process, and transfer) the data, but not close the stream until all data has been processed. The ADIOS2 stream is opened at the beginning for a single process and the application that is grabbing the data opens all the streams (for each process) at once. The error always happens on the initiation of the streams and I can't tell which application is having the problem first.

eisenhauer commented 4 years ago

Well, I suspect that the best way to make progress on sorting this out would be for me to be able to run the code and try to diagnose first-hand. Is it something you can easily share?

AaronV77 commented 4 years ago

The error is kind of difficult to reproduce due to it occurring on our HPC and having access to that needs clearances. One thing I failed to mention is that on single runs of the workflow things happen correctly but when I run my python tester script is when things break. Could constantly opening and closing streams of the same name cause problems over time? Lastly, let me get with my project manager and see what we could do about having you take a look at the source code.

eisenhauer commented 4 years ago

Ah. I think I didn't look at the logs you sent closely enough... So, this is a kind of work distribution model? And you're using processor_comms1 as a stream name for the work to worker #1, and using the same name for the stream that will carry the completed work back? If so, you might be able to clear this up by simply using separate forms for outbound and inbound. E.G. "Processor_outN" and "Processor_inN".

What I think might be happening, at least based on my understanding at the moment, is that writer is opening, say, "processor_comms3" in write-mode, and then when a reader arrives, immediately trying to open a stream with the same name in read-mode. While on the reader side, you're opening in read mode, and then immediately opening in write mode. When you open an SST stream in Write mode, SST creates a .sst file of that name which contains contact information (IP address, port, etc.) for that stream. In Read mode, SST looks for the file of that name, waiting some time for it to be created if it doesn't exist yet. The file is removed only upon writer shutdown (to potentially allow new readers throughout the lifetime of the stream). The way you are reusing the name, there's a race condition. The reader and the writer both operate on the first instance of the file to open the outbound stream, but then when you go to open the same-named stream in the other direction you've got to hope that the reader side gets to the create-the-contact-file point before the writer gets around to reading it. When things work, the old file gets overwritten and the writer reads the new version. But if he happens to get the old version (I.E. the one he wrote when he created the outbound), then he's opening a connection to himself and that's bad. Not only is it not what you'd have wanted, but we didn't consider a process being a reader on their own stream.

Please let me know if all this makes sense. Or if you think I'm still misunderstanding what you're doing somehow...

AaronV77 commented 4 years ago

Sorry for not getting back to you sooner but here is a little more information about how I am setting up ADIOS2 and how I thought ADIOS2 SST engine worked.

To Setup ADIOS2 (Writer):

  1. You create and ADIOS2 object that just takes an MPI Communicator (depends if you are using MPI but in our case I am).
  2. Declare IO under a given ADIOS2 object (but since I have a process reading and writing, declare two IO objects w/ two different names. One is a reader and the other is a writer).
  3. Set both IO objects to the sst engine.
  4. Open a writer object using the first IO object, stream name, and write flag.
  5. Open a reader object using the second IO object, stream name, and read flag.
  6. Declare ADIOS2 variable for sending the same type of data back and forth.

From what I am understanding of your initial statement is that I should have two communication names (stream names) that have one IO object apiece? For the other application that this application is communicating with is doing the same process but opens the reader first and then the writer (because the writing application is opening the writer first and then the reader). One thing I just noticed but can't really explain why I am doing this is that the two IO objects on the writing application has 1processor-N (writer) and 2processor-N (reader) and for the reading application I use 3processor-N (reader) and 4processor-N (writer). Does this make sense? I can't remember why I did the IO objects like that, maybe because they had to have distinctive names on a given stream. I feel like I am doing this wrong.

eisenhauer commented 4 years ago

Lets put aside the IO objects for a moment. There may be complexities there, but they should be orthogonal to this race condition. I think the misunderstanding is that SST streams are unidirectional, existing to convey data from a single writer to zero or more readers (pub/sub in that sense). Opening a stream in write mode creates a stream with that name, like the BP engine creates a file when you open in write mode. In fact, SST does create a file when you open a stream in write-mode, but that file just contains contact information for the writer and is not used by the reader after initialization.

When you do : Writer

  1. Open a writer object using any IO object, stream name, and write flag.
  2. Open a reader object using any IO object, stream name, and read flag. and then Reader
  3. Open a reader object using any IO object, stream name, and read flag.
  4. Open a writer object using any IO object, stream name, and writeflag.

You're really creating two streams with the same name. The reader action 2 overwrites the stream contact information written in writer action 1. If that overwriting happens before writer action 2 (reading the stream contact info in name.sst), then you get what you want, a stream in each direction. But if the reader is delayed between actions 1 and 2, or if the filesystem is a bit slow, then action 2 by the writer will read the contact information for the unidirectional stream it created in action 1, which is not what you want (and which likely deadlocks). So, use a different name for the streams in each direction and this bit at least should be OK...

AaronV77 commented 4 years ago

So what is the point of the IO names then? Let me switch things around and see what happens.

pnorbert commented 4 years ago

IO names are nothing but identifiers of the IO group objects. The name can be used in an xml/yaml config file to parameterize the engine/transport before Open() happens in the code. Other than that, it has no use in the streams/files themselves.

The name passed at Open() that really names the individual stream, between 1 writer (a parallel app) and multiple readers (each a parallel app). So as Greg explained, you cannot use the same (file) name for a "bidirectional" stream between two applications. You create two one-directional streams with different (file) names.

On Tue, Jun 30, 2020 at 12:54 PM Aaron V. notifications@github.com wrote:

So what is the point of the IO names then? Let me switch things around and see what happens.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS2/issues/2315#issuecomment-651918607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYYYLIFASHI2AA55BGFNMLRZIKEFANCNFSM4NZNZ6RQ .

AaronV77 commented 4 years ago

@eisenhauer and @bradking thank you for your guy's help with this issue that I was experiencing and updating my linking to the newer version of ADIOS2. Everything seems to be working correctly now and no hang-ups.

eisenhauer commented 4 years ago

Glad to hear it!