sstsimulator / sst-macro

SST Macro Element Library
http://sst-simulator.org/
Other
34 stars 41 forks source link

Segmentation fault when running test with star topology #617

Closed afranques closed 3 years ago

afranques commented 3 years ago

Hello, I am trying to use the star topology , but I haven't managed to do so successfully. I started by doing a grep to see what .ini files contained examples with topology.name = star. I found 2 examples that used this topology:

  1. The first example I found was test_ping_all_star.ini, but it didn't run off-the-shelf, since it complained that error: sim_parameters: could not find parameter name in namespace switch (after adding switch.name = pisces, it still complained, so I had to add switch.logp, and node.proc as well). After accepting all parameters, the simulation then failed with the error I show below.
  2. The second example I found was opa24_amm3.ini. Similarly to the previous example this one also required adding switch.logp and node.proc, and once it accepted the configuration it also threw the same error as the previous example (error shown below).

I'm running with the integrated sst-core with: pysstmac -f xxx.ini (where xxx.ini is either test_ping_all_star.ini or opa24_amm3.ini)

The error that either of these two .ini throw is:

 *** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: (nil)
[ 0] /lib64/libpthread.so.0(+0xf630)[0x7fb8506e5630]
[ 1] .../local/sstmacro-11.0.0/lib/libmacro.so.11(_ZN6sstmac2hw4Node5setupEv+0xd [0x7fb838a0792d]
.../local/sstcore-10.0.0/bin/sst(_ZN3SST10Simulation5setupEv+0x36)[0x4f9f86]
[ 3] .../local/sstcore-10.0.0/bin/sst[0x4a9c02]
[ 4] .../local/sstcore-10.0.0/bin/sst(main+0xeca)[0x49a1ea]
[ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb84eb8b555]
[ 6] .../local/sstcore-10.0.0/bin/sst[0x4a94ec]
*** End of error message ***
.../local/sstmacro-11.0.0/bin/pysstmac: line 7: 1018275 Segmentation fault
.../local/sstcore-10.0.0/bin/sst    .../local/sstmacro-11.0.0/include/python/default.py    --model-options="$options"

Can anyone verify that either of these two configuration files are seg-faulting in your machine as well, please?

For convenience, I have wrote a configuration (4 nodes with 2 cores each, all connected to a central switch in a star topology, running mpi_ping_all test with 8 ranks) containing only the essential parameters while still throwing the same segmentation fault (shown above) that with the aforementioned 2 examples:

node {
    name = simple

    app1 {
        name = mpi_ping_all
        launch_cmd = aprun -n 8 -N 2
    }
    proc {
        frequency = 2Ghz
        ncores = 2
    }
    nic {
        injection {
            bandwidth = 12GB/s
            latency = 0.6us
        }
    }
}

topology {
    name = star
    concentration = 4
}

switch {
    name = pisces
    arbitrator = simple
    mtu = 4096

    xbar {
        bandwidth = 1.2TB/s
        latency = 100ns
    }
    link {
        bandwidth = 12.5GB/s
        latency = 100ns
    }
    router {
        name = star_minimal
    }
    logp {
        bandwidth = 12.5GB/s
        out_in_latency = 15ns
        hop_latency = 15ns
    }
}

Any help will be much appreciated!

Thanks, Antonio

jpkenny commented 3 years ago

Hi Antonio,

Unfortunately macro is much better at parameter checking and avoiding segfaults due to model configuration issues when running without sst-core. I had better luck using stand-alone sst-macro. I was able to get your input to run by adding a number of missing parameters and once it ran in the stand-alone build it also worked through pysstmac (sst-core):

node {
    name = simple
    app1 {
        name = mpi_ping_all
        launch_cmd = aprun -n 8 -N 2
    }
    proc {
        frequency = 2Ghz
        ncores = 2
    }
    nic {
        name = pisces
        injection {
            arbitrator = cut_through
            bandwidth = 12GB/s
            latency = 0.6us
            credits = 8KB
            mtu = 4096
        }
    }
    memory {
      name = pisces
      total_bandwidth = 100GB/s
      latency = 10ns
    }
}

topology {
    name = star
    concentration = 4
}

switch {
    name = pisces
    arbitrator = simple
    mtu = 4096

    xbar {
        bandwidth = 1.2TB/s
        latency = 100ns
    }
    link {
        bandwidth = 12.5GB/s
        latency = 100ns
    }
    router {
        name = star_minimal
    }
    logp {
        bandwidth = 12.5GB/s
        out_in_latency = 15ns
        hop_latency = 15ns
    }
}

I'd like to get away from (poorly) supporting different modes of operation, but that's going to be a long term effort.

I also noticed that the star inputs were commented out of the test suite at some point, so they aren't currently being tested. Sorry about that, I don't know why that was done.

--Joe

afranques commented 3 years ago

Wow! I was definitely far from fixing it myself then, because the last place I was looking was at the parameters haha

Life saver @jpkenny, thank you very much for such prompt and efficient response, your configuration works now on my end as well! Next time I run into a similar issue I will start by trying the standalone mode first ;-)

-Antonio