parsa-epfl / qflex

Quick & Flexible Rack-Scale Computer Architecture Simulator
http://qflex.epfl.ch/
31 stars 10 forks source link

Segmentation fault when running Qflex with timing mode #26

Closed YinghuiShao closed 3 years ago

YinghuiShao commented 3 years ago

When I ran the QFlex with timing mode, a segmentation fault occurred. I debugged with GDB but it doesn't work. How to solve this problem ? Run log is below.

//   QFlex simulator - Built as KnottyKraken v1.0

5 <startup.cpp:236> {0}- Initializing Flexus.
6 <startup.cpp:238> {0}- Compiled with Boost: 1.70.0
7 <startup.cpp:110> {0}- Instantiating Flexus components with SystemWidth = 1
8 <ComponentManager.cpp:85> {0}- Instantiating system with a width factor of: 1
9 <uFetch.hpp:99> {0}- ufetch port InstructionFetchSeen is not wired
10 <uFetch.hpp:99> {0}- ufetch port ClockTickSeen is not wired
11 <armDecoder.hpp:76> {0}- decoder port DispatchedInstructionOut is not wired
12 <uArchARM.hpp:140> {0}- uarcharm port StoreForwardingHitSeen is not wired
13 <Cache.hpp:102> {0}- L1d port FrontSideOut_I is not wired
14 <Cache.hpp:102> {0}- L1d port BackSideOut_Prefetch is not wired
15 <breakpoint_tracker.cpp:504> {0}- Successfully registered RegressionTrackerMagicBreakpoint with QEMU, cpu_id = 0, struct id = 0
16 <wiring.cpp:100> {0}-  initializing Parameters...
17 <flexus.cpp:421> {0}- Set stat interval to : 100000
18 <flexus.cpp:441> {0}- Set profile interval to : 10000000
19 <flexus.cpp:446> {0}- Set timestamp interval to : 50000
Formatting '/home/s00523304/qflex/qflex/images/ubuntu16/ubuntu.qcow2-1AEFE10D-i1-tmp', fmt=qcow2 size=21474836480 backing_file=/home/s00523304/qflex/qflex/images/ubuntu16/ubuntu.qcow2 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
WARNING: There is no parameter named "-bpwarm:cores"
WARNING: There is no parameter named "-feeder:stick"
WARNING: There is no parameter named "-feeder:housekeeping_period"
WARNING: There is no parameter named "-feeder:ifetch"
WARNING: There is no parameter named "-feeder:CMPwidth"
WARNING: There is no parameter named "-feeder:send_non_allocating_stores"
WARNING: There is no parameter named "-L1d:mt_width"
WARNING: There is no parameter named "-L1d:size"
WARNING: There is no parameter named "-L1d:assoc"
WARNING: There is no parameter named "-L1d:clean_evict"
20 <configuration.hpp:210> {0}- Bad Lexical Cast attempting to set dynamic parameter.
WARNING: Unable to set parameter CacheLevel to eL1
WARNING: There is no parameter named "-L1d:notify_reads"
WARNING: There is no parameter named "-L1d:notify_writes"
WARNING: There is no parameter named "-L1d:trace_tracker_on"
WARNING: There is no parameter named "-L1d:rsize"
WARNING: There is no parameter named "-L1d:rt_assoc"
WARNING: There is no parameter named "-L1d:rt_size"
WARNING: There is no parameter named "-L1d:rt_repl"
WARNING: There is no parameter named "-L1d:erb_size"
WARNING: There is no parameter named "-L1d:std_array"
WARNING: There is no parameter named "-L1d:block_scout"
WARNING: There is no parameter named "-L1d:skew_block_set"
WARNING: There is no parameter named "-L1d:protocol"
WARNING: There is no parameter named "-L1d:using_traces"
WARNING: There is no parameter named "-L1d:downgrade_lru"
WARNING: There is no parameter named "-L1d:snoop_lru"
WARNING: There is no parameter named "-L1i:mt_width"
WARNING: There is no parameter named "-L1i:size"
WARNING: There is no parameter named "-L1i:assoc"
WARNING: There is no parameter named "-L1i:bsize"
WARNING: There is no parameter named "-L1i:clean_evict"
WARNING: There is no parameter named "-L1i:level"
WARNING: There is no parameter named "-L1i:notify_reads"
WARNING: There is no parameter named "-L1i:notify_writes"
WARNING: There is no parameter named "-L1i:trace_tracker_on"
WARNING: There is no parameter named "-L1i:rsize"
WARNING: There is no parameter named "-L1i:rt_assoc"
WARNING: There is no parameter named "-L1i:rt_size"
WARNING: There is no parameter named "-L1i:rt_repl"
WARNING: There is no parameter named "-L1i:erb_size"
WARNING: There is no parameter named "-L1i:std_array"
WARNING: There is no parameter named "-L1i:block_scout"
WARNING: There is no parameter named "-L1i:skew_block_set"
WARNING: There is no parameter named "-L1i:protocol"
WARNING: There is no parameter named "-L1i:using_traces"
WARNING: There is no parameter named "-L1i:text_flexpoints"
WARNING: There is no parameter named "-L1i:gzip_flexpoints"
WARNING: There is no parameter named "-L1i:downgrade_lru"
WARNING: There is no parameter named "-L1i:snoop_lru"
WARNING: There is no parameter named "-L2:CMPWidth"
WARNING: There is no parameter named "-L2:size"
WARNING: There is no parameter named "-L2:assoc"
WARNING: There is no parameter named "-L2:clean_evict"
21 <configuration.hpp:210> {0}- Bad Lexical Cast attempting to set dynamic parameter.
WARNING: Unable to set parameter CacheLevel to eL2
WARNING: There is no parameter named "-L2:trace_tracker_on"
WARNING: There is no parameter named "-L2:repl"
WARNING: There is no parameter named "-L2:rsize"
WARNING: There is no parameter named "-L2:rt_assoc"
WARNING: There is no parameter named "-L2:rt_size"
WARNING: There is no parameter named "-L2:erb_size"
WARNING: There is no parameter named "-L2:std_array"
WARNING: There is no parameter named "-L2:directory_type"
WARNING: There is no parameter named "-L2:protocol"
WARNING: There is no parameter named "-L2:always_multicast"
WARNING: There is no parameter named "-L2:seperate_id"
WARNING: There is no parameter named "-L2:coherence_unit"
22 <configuration.hpp:210> {0}- Bad Lexical Cast attempting to set dynamic parameter.
WARNING: Unable to set parameter CacheLevel to eL1
23 <configuration.hpp:210> {0}- Bad Lexical Cast attempting to set dynamic parameter.
WARNING: Unable to set parameter CacheLevel to eL2
WARNING: There is no parameter named "-memory:device-file"
WARNING: There is no parameter named "-memory:memory-system-file"
WARNING: There is no parameter named "-memory:interleaving"
WARNING: There is no parameter named "-memory:frequency"
WARNING: There is no parameter named "-memory:dyn_size"
WARNING: There is no parameter named "-memory:size"
WARNING: There is no parameter named "-memory:max_replies"
WARNING: There is no parameter named "-memory:InterconnectDelay"
WARNING: There is no parameter named "-L1d:size"
WARNING: There is no parameter named "-L2:size"
WARNING: There is no parameter named "-L2:assoc"
WARNING: There is no parameter named "-L2:CMPWidth"
WARNING: There is no parameter named "-feeder:CMPwidth"
24 <ComponentManager.cpp:100> {0}- Initializing 16 components...
25 <ComponentManager.cpp:105> {0}- Component 1: Initializing sys-fag
26 <ComponentManager.cpp:105> {0}- Component 2: Initializing sys-ufetch
27 <ComponentManager.cpp:105> {0}- Component 3: Initializing sys-combiner
28 <ComponentManager.cpp:105> {0}- Component 4: Initializing sys-decoder
29 <ComponentManager.cpp:105> {0}- Component 5: Initializing sys-uarcharm
30 <microArch.cpp:151> {0}- sys-uarcharm connected to cpu0
31 <ComponentManager.cpp:105> {0}- Component 6: Initializing sys-L1d
32 <ComponentManager.cpp:105> {0}- Component 7: Initializing sys-mmu
33 <ComponentManager.cpp:105> {0}- Component 8: Initializing sys-L2
34 <CMPCacheImpl.cpp:96> {0}- GroupInterleaving = 4096
35 <NonInclusiveMESIPolicy.cpp:109> {0}- GI = 4096
36 <NonInclusiveMESIPolicy.cpp:88> {0}- GI = 4096
37 <StdArray.hpp:586> {0}- theGroupInterleaving = 4096
38 <StdArray.hpp:695> {0}- blockOffsetBits = 6, indexBits = 11, bankBits = 0, bankInterleavingBits = 6, groupBits = 0, groupInterleavingBits = 12, lowBits = 0, midBits = 6, highBits = 5, setLowMask = 0, setMidMask = 3f, setHighMask = 7c0, setLowShift = 6, setMidShift = 6, setHighShift = 6, theBankMask = 0, theBankShift = 6, theGroupMask = 0, theGroupShift = 12
39 <AbstractCacheController.hpp:77> {0}- sys-L2: created AbstractCacheController 'sys-L2'
40 <ComponentManager.cpp:105> {0}- Component 9: Initializing sys-memory
41 <ComponentManager.cpp:105> {0}- Component 10: Initializing 00-nic
42 <ComponentManager.cpp:105> {0}- Component 11: Initializing 01-nic
43 <ComponentManager.cpp:105> {0}- Component 12: Initializing 02-nic
44 <ComponentManager.cpp:105> {0}- Component 13: Initializing sys-network
Attaching node 0 to switch 0:0
Attaching node 1 to switch 0:1
Attaching node 2 to switch 0:2
WARNING: switch 0 port 3 left unused (may be safe)
WARNING: switch 0 port 4 left unused (may be safe)
WARNING: switch 0 port 5 left unused (may be safe)
WARNING: switch 0 port 6 left unused (may be safe)
45 <ComponentManager.cpp:105> {0}- Component 14: Initializing sys-memory-map
46 <ComponentManager.cpp:105> {0}- Component 15: Initializing sys-magic-break
47 <ComponentManager.cpp:105> {0}- Component 16: Initializing sys-net-mapper
48 <SplitDestinationMapperImpl.cpp:145> {0}- Creating SplitDestinationMapper with 1 cores, 1 directories, and 1 memory controllers.
49 <ValueTracker.hpp:240> {233}- ALEX -- WARNING: DMA tracker has not been set up (Needs to be fixed)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff3dbc241 in __pth_scheduler () from /root/lib/libpth.so.20
(gdb) bt
#0  0x00007ffff3dbc241 in __pth_scheduler () at /root/lib/libpth.so.20
#1  0x00007ffff3dbda80 in pth_spawn_trampoline () at /root/lib/libpth.so.20
#2  0x00007ffff279b7b0 in __start_context () at /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000000000000000 in  ()

I compiled pth with debug mode and error is below.

================== THREAD CONTEXT SWITCH ===========================================
23283:pth_sched.c:0320: Finished switch back to pth_sched stack 0x555556755350, size 65536, FROM stack 0x0, size 0
23283:pth_sched.c:0325: pth_scheduler: cameback from thread 0x5555568fdda0 ("unknown")
23283:pth_sched.c:0334: pth_scheduler: thread "unknown" ran 0.160900

Program received signal SIGSEGV, Segmentation fault.
__pth_scheduler (dummy=<optimized out>) at pth_sched.c:370
370                 if (*pth_current->stackguard != 0xDEAD) {
(gdb) bt
#0  0x00007ffff3db7937 in __pth_scheduler (dummy=<optimized out>) at pth_sched.c:370
#1  0x00007ffff3db99d0 in pth_spawn_trampoline () at pth_lib.c:271
#2  0x00007ffff27967b0 in __start_context () at /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000000000000000 in  ()

Additionally, build_qemu.sh calls build_pth.sh to generate libpth.so. I ran ./build_qemu.sh -timing but an error hanppened.

  CC      chardev/char-udp.o
  LINK    tests/qemu-iotests/socket_scm_helper
  CC      qga/commands.o
/usr/bin/ld: cannot find -lpth
collect2: error: ld returned 1 exit status
/home/s00523304/qflex/qflex/qemu/rules.mak:121: recipe for target 'tests/qemu-iotests/socket_scm_helper' failed
make: *** [tests/qemu-iotests/socket_scm_helper] Error 1
make: *** Waiting for unfinished jobs....
  AS      optionrom/multiboot.o
  CC      optionrom/linuxboot_dma.o

So I first ran ./build_pth.sh then ran ./build_qemu.sh -timing and there was no error. Does the uncorrect order result in my first error “Segmentation fault”?

Hnefi commented 3 years ago

Hi, thanks for raising this issue and bringing it to our attention.

For us to help you, we need some more details about your system setup, and we need to have a minimal reproducible example. I was not able to reproduce this issue in my environment yet. Can you let us know what is your host system, compiler/OS version, and how you built QEMU and Qflex, so we can make progress on narrowing down the source of the problem? Also, was this error generated when trying to run the image we provide with our tutorial?

Regarding your second question, I am not totally clear about the order of tools you used. The correct way of building the system is to first install the dependencies, then build the PTH library, and then build QEMU & QFlex. Otherwise, QEMU would not actually build. Can you let us know the process you went through to actually build the tools successfully and then run them?

YinghuiShao commented 3 years ago

Hi, thanks for raising this issue and bringing it to our attention.

For us to help you, we need some more details about your system setup, and we need to have a minimal reproducible example. I was not able to reproduce this issue in my environment yet. Can you let us know what is your host system, compiler/OS version, and how you built QEMU and Qflex, so we can make progress on narrowing down the source of the problem? Also, was this error generated when trying to run the image we provide with our tutorial?

Regarding your second question, I am not totally clear about the order of tools you used. The correct way of building the system is to first install the dependencies, then build the PTH library, and then build QEMU & QFlex. Otherwise, QEMU would not actually build. Can you let us know the process you went through to actually build the tools successfully and then run them?

I want to run Matrix Multiplication with Qflex to test determinstic feature.

My host system is "x86_64 GNU/Linux" and my linux version is "Ubuntu 18.04.2 LTS" .

This error was generated when trying to run the image you provide with tutorial. I strictly followed the instructions on https://qflex.epfl.ch/download/ --Example: Running Matrix Multiplication with QFlex – Timing (KnottyKraken). Downloading QFlex and setting up the environment • Clone the main QFlex repository along with all the submodules in the $QFLEX directory.

$ git clone https://github.com/parsa-epfl/qflex --recurse-submodules

After this step, I cloned most of qflex codes except the "qemu/dtc" and "qemu/roms" for "connection timed out" error. • Go to qemu directory, install required dependencies and build QEMU with the -timing option (README).

• Go to flexus directory, install required dependencies and build the KnottyKraken simulator (README). o Build KnottyKraken using cmake -DSIMULATOR=KnottyKraken . && make -j. • Go to images directory, checkout the matmul branch and setup the image (README).

I ran "$home/qflex/qemu/scripts/snap-manager.py --qemu-img-cmd-path $home/qflex/qemu update $home/qflex/images/ubuntu16/ubuntu.qcow2"(img_cmd) and an error occurred.So I first buildt qemu with emulation mode and then ran img_cmd ,finally I built qemu with timing mode.

• Go to scripts/captain directory ($CAPTAIN), and setup the config parameters (README). o Use simulation_type=timing along with the correct flexus_path and flexus_timing_path. o Use icount=on in config/system.ini. o Provide the correct user_postload path. An example file is given at config/user_postload. o Update the paths in config/user_postload corresponding to “To be updated” according to your setup. • Run the captain script to start simulation. o Create a $QFLEX/run directory to contain all the files produced during the run and cd $QFLEX/run. o Create an output directory (e.g. $QFLEX/run/output) to store the logs from the simulation. o Use echo 1 > $QFLEX/run/preload_system_width to provide the system width to KnottyKraken. (1 is the number of cores in matmul). o Run captain using $CAPTAIN/captain $CAPTAIN/config/system.ini -o $QFLEX/run/output.

Hnefi commented 3 years ago

Thanks for the reply - I see that you mainly followed the identical commands from the online tutorial. There are a few things to mention here:

I tried to replicate your segmentation fault error on a clean Ubuntu 18.04 environment, but I'm unable to reproduce the problem. Also, do I understand correctly that you first had a segmentation fault, but then after re-building PTH, there is now no error and you are able to run the simulator?

I suggest we narrow this problem down as much as possible, to find a minimal reproducible example. Perhaps we can do the following:

We can then investigate more afterwards. Thanks!

Hnefi commented 3 years ago

Hi @YinghuiShao, I see that you have edited the original issue with the following information. Please add new info as further comments below because then I can follow the discussion more clearly.

================== THREAD CONTEXT SWITCH =========================================== 23283:pth_sched.c:0320: Finished switch back to pth_sched stack 0x555556755350, size 65536, FROM stack 0x0, size 0 23283:pth_sched.c:0325: pth_scheduler: cameback from thread 0x5555568fdda0 ("unknown") 23283:pth_sched.c:0334: pth_scheduler: thread "unknown" ran 0.160900

This debug information indicates that the PTH scheduler is switching into its scheduler thread, from an unknown/unallocated stack (you can see the stack ptr is 0x0). We still don't have enough information to reproduce this problem. It is possible that this happens when the threading system is initialized for the first time, and thus it switches to the scheduler stack from the currently running hardware thread, which will not have allocated its stack from PTH. Can you tell me exactly your sequence of commands to gather this output behaviour, so I can attempt to reproduce the problem?

If you have solved this in your own way, as you indicated in #27, then we would really appreciate you opening a pull request in the PTH repository with an explanation of the problem, how to reproduce it, and the solution. We can then consider accepting it to our repo.

Cheers.

YinghuiShao commented 3 years ago

The pth stack overflow problem has been handled by increasing its stack size( pth_attr.c, line 92 in pth_attr_init function ).

Hnefi commented 3 years ago

Hi @YinghuiShao, I am re-opening this issue as I recently experienced a PTH stack overflow when running a simulation on a different platform than the one I previously used. I have a fix in the pipeline which I am testing now.

Hnefi commented 3 years ago

I'm closing this since it was fixed here: https://github.com/parsa-epfl/qemu/pull/58