sstsimulator / sst-core

SST Structural Simulation Toolkit Parallel Discrete Event Core and Services
http://www.sst-simulator.org
Other
125 stars 88 forks source link

--enable-perf-tracking --enable-profile crashes #808

Open pdbj opened 2 years ago

pdbj commented 2 years ago

[Edit to fix phold make command]

New Issue for sst-core

1 - Detailed description of problem or enhancement With --enable-perf-tracking --enable-profile during configure several of my models crash with seg faults. I have not been able to reproduce with the car wash example in simpleSimulation. I can reproduce with a simple PHOLD model (see below).

2 - Describe how to reproduce the issue Use my PHOLD model

  1. First, configure, build and install sst-core: 1.a Patch upstream 83ddb8ca with these commits

(Or 1.a' Use my wip branch)

1.b Configure, build and install sst-elements (so we can get the library directory by sst-config SST_ELEMENT_LIBRARY SST_ELEMENT_LIBRARY_LIBDIR)

  1. Get and build my PHOLD:
    
    $ git clone https://github.com/pdbj/sst-phold
    $ cd sst-phold
    $ make debug install

Check that it installed correctly:

$ sst-info phold PROCESSED 1 .so (SST ELEMENT) FILES FOUND IN DIRECTORY(s) /Users/barnes26/Code/SST/build/profile/lib/sstcore:/usr/local/lib/sst-elements-library Filtering output on Element = "phold"

ELEMENT 0 = phold () Num Components = 2 Component 0: Phold ...


3. Run with high verbosity:

$ sst tests/phold.py -- -vvv

phold.py : Creating PHOLD Benchmark phold.py : Importing SST module phold.py : Creating 2 LPs .. phold.py : Creating complete graph with latency 1 s (1 total) . phold.py : Enabling statistics at level 1 phold.py : Done

0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12479c0, id: 0, name: phold_0 0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:271)] -> [2] 0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:278)] -> [3] TIMEFACTOR: 0.000001, timeConverter factor: 1000000, period: 1 us (0.000001 s?) 0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:290)] -> [3] min: 1 s, duty: 100 m, df: 0.100000 0:[0:0]:Phold-phold_0 [ShowConfiguration() (Phold.cc:296)] -> [3] m_ev: 1, ev_win: 0.100000, min_ev_win: 10.000000, min_ev: 100 0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> PHOLD Configuration: Remote LP fraction: 0.9 Minimum inter-event delay: 1 s Additional exponential average delay: 9 s Stop time: 10 s Number of LPs: 2 Number of initial events per LP: 1 Average events per window: 0.1 (Too low! Suggest setting '--events=100') Expected total number of events: 2 Output delay histogram: no Sampling: rng Optimization level: debug Verbosity level: 3

0:[0:0]:Phold-phold_0 [ShowConfiguration()] -> SST Configuration: Rank, thread: 0, 0 Total ranks, threads: 1, 1 Run mode: BOTH

0:[0:0]:Phold-phold_0 [ShowSizes() (Phold.cc:360)] -> [2] 0:[0:0]:Phold-phold_0 [ShowSizes()] -> Sizes of objects: Phold: 432 (class instance)

Plus heap allocated:
SST::RNG::MersenneRNG:                             24 (m_rng)
SST::RNG::MarsagliaRNG:                            16 (m_remRng)
SST::RNG::SSTUniformDistribution:                  32 (m_nodeRng)
SST::RNG::SSTExponentialDistribution:              32 (m_delayRNg)
SST::Statistics::AccumulatorStatistic<uint64_t>:  384 (m_sendCount)
SST::Statistics::AccumulatorStatistic<uint64_t>:  384 (m_recvCount)
SST::Statistics::HistogramStatistic<uint64_t>:    464 (m_delays)
  (Bins are stored in a map, so additional 3 * 8 bytes per bin.)
Subtotal heap allocated: :                        1336
SST::Link:                                         64 (N * (N - 1) links total)

Other components:
SST::UnitAlgebra:                                  96 (statics TIMEBASE, m_average)
SST::TimeConverter:                                 8 (static m_timeConverter)
SST::Output:                                      144 (m_output, included in Phold)
SST::Core::ThreadSafe::Barrier:                    64 (many instances in Simulator_impl)
std::atomic<bool>:                                  1 (used by Barrier)
std::atomic<std::size_t>:                           8 (used by Barrier)
std::string:                                       32 (VERBOSE_PREFIX, included in Phold)

0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs 0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:152)] -> [3] Configuring links: 0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:184)] -> [3] Initializing statistics 0:[0:0]:Phold-phold_0 [Phold() (Phold.cc:188)] -> [3] Setting stopat to 10 s 0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:109)] -> [2] Full c'tor() @0x12e6c50, id: 1, name: phold_1 0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:135)] -> [3] Initializing RNGs 0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:152)] -> [3] Configuring links: 0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:184)] -> [3] Initializing statistics 0:[0:0]:Phold-phold_1 [Phold() (Phold.cc:188)] -> [3] Setting stopat to 10 s 0:[0:0]:Phold-phold_1 [init() (Phold.cc:596)] -> [2] depth: 1, phase: 0, begin: 0, end: 1 0:[0:0]:Phold-phold_1 [init() (Phold.cc:601)] -> [3] checking for early events 0:[0:0]:Phold-phold_1 [checkForEvents() (Phold.cc:554)] -> [3] checking link 0 0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:541)] -> [3] getting event from link 0 0:[0:0]:Phold-phold_1 [getEvent() (Phold.cc:543)] -> [3] got (nil) 0:[0:0]:Phold-phold_1[quartz770:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)


The segfault occurs in `SST::Link::recvUntimedData()`:

Backtrace: sst(SST::Link::recvUntimedData()+0x48)[0x8d5cd4] /g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::InitEvent* Phold::Phold::getEvent(unsigned long)+0x62)[0x2aaab7e4154e] /g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(void Phold::Phold::checkForEvents(std::__cxx11::basic_string<char, std::char_traits, std::allocator >)+0xc4)[0x2aaab7e41680] /g/g17/barnes/Code/SST/profile/default/lib/sst-elements-library/libphold.so(Phold::Phold::init(unsigned int)+0x22b)[0x2aaab7e3913f] sst(SST::Simulation_impl::initialize()+0x110)[0x8e7dce] sst[0x825cd8] sst(main+0x1ab7)[0x827964] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaad18c555] sst[0x8245b9]


But I don't think that's the source of the problem:  if I comment out all `delete event` in my code, it progresses further, but still eventually seg faults.

**3 - Diagnosis**

I think the problem arises when the normal event lifetime pattern encounters event profiling.

Here is the "normal" pattern:

// From model code: create and send an event: auto event = new Event(...); link->send(delay, event);

// In simulator.cc:Simulation_impl::run():685 while ( LIKELY(!endSim) ) { current_activity = timeVortex->pop(); currentSimCycle = current_activity->getDeliveryTime(); currentPriority = current_activity->getPriority(); current_activity->execute();

// In event.cc:Event::execute():59: (*functor) (this); // where functor is the event handler

// Finally, back in model code, the event handler void ...::handleEvent(SST::Event ev) { auto event = dynamic_cast<...>(ev); // Extract data from the event

// Done with the event, delete it, to match the new in the sending code delete event; }


Before profiling was introduced (or without it enabled) deleting the Event in the handler was harmless, if a little odd.  It amounted to `delete this;`.  See [this isocpp FAQ](https://isocpp.org/wiki/faq/freestore-mgmt#delete-this)

But with profiling enabled, `Event::execute()` continues after calling the handler:

void Event::execute(void) {

if SST_EVENT_PROFILING

SST_EVENT_PROFILE_START

endif

    (*functor)
(this);

if SST_EVENT_PROFILING

Simulation_impl* sim = Simulation_impl::getSimulation();

SST_EVENT_PROFILE_STOP

// Track sending and receiving counters
auto eventCount = sim->eventRecvCounters.find(getLastComponentName());

...



The killer is `getLastComponentName()`, a member function of `Event`, which is bad enough, but it goes on to return a member variable, `last_comp`.  Boom!

**4 - Suggested fix**

Suggest pulling all required member variables in to local variables _before_ calling the functor.  It appears this could be limited to the return values of `getFirstComponentName()` and `getLastComponentName()`, but really all the lookups and registrations could be done in the preamble `SST_EVENT_PROFILE_START`

I haven't looked at the other places affected by profiling to see if there are similar issues.

**5 - What Operating system(s) and versions** 
Two RHL variants, Mac

**6 - What version of external libraries (Boost, MPI)**
Not relevant, running sequentially.

**7 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc)**
sst-core: 83ddb8ca (with my patches) based on upstream df2df5f9
pdbj commented 2 years ago

I've found a fix, see my wip branch at 9ba01bf.

This is sufficient, but it might not be minimal; at least it bounds the solution space.

pdbj commented 2 years ago

Revised Fix crash due to reference deleted event state

jleidel commented 2 years ago

@pdbj, I believe this patch is somewhat stale as well. It fails on src/sst/core/event.cc

pdbj commented 2 years ago

Rebased Fix crash due to reference deleted event state

jleidel commented 2 years ago

@pdbj, the patch code in event.cc is still too far behind devel to merge. The upstream devel branch has diverged significantly. You might want to check the changes from https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43

pdbj commented 2 years ago

The commit https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 appears to be to master, not devel. In any case that commit makes trivial changes (Simulation::getSimulation --> Simulation_impl::getSimulation). My edits to event.cc also use the latter, and don't touch the two places https://github.com/sstsimulator/sst-core/commit/9b90e81834a123f13c097940aafece479e1c1c43 does, so there shouldn't be a merge conflict.

jleidel commented 2 years ago

@pdbj, its definitely a merge conflict. If you look at the current event.cc code on the devel branch, the references to SST_EVENT_PROFILE_REG do not exist. As a result, the standard git patch application functions fail to successfully patch/merge the code.

pdbj commented 2 years ago

Apologies, you actually need two commits:

Pre-register event profiling

then

Fix crash due to reference deleted event state

If I squash these now I'll have to update the rebase links in all the other issues, so I'd rather not...

jleidel commented 2 years ago

@pdbj, that makes more sense. I'll try to pull both patches and incrementally merge them into the current devel tree.

jleidel commented 2 years ago

@pdbj, one other thing to note when submitting patches/PRs. Make sure and run the scripts/clang-format-test.sh script with clang-format-12 prior to committing changes. The PRs will fail the frontend tests if the code is not formatted using the embedded clang format template.

jleidel commented 2 years ago

merged into devel