harvard-acc / smaug

SMAUG: Simulating Machine Learning Applications Using Gem5-Aladdin
https://harvard-acc.github.io/smaug_docs
BSD 3-Clause "New" or "Revised" License
97 stars 27 forks source link

Simulation with ResNet fails #103

Open daecheolyou opened 3 years ago

daecheolyou commented 3 years ago

During simulation with ResNet, a segmentation fault occurs at gem5. I created ResNet pb and pbtxt file by running smaug/experiments/models/imagenet-resnet/resnet_network.py All configuration files are the same with minerva example, but only model_files was modfied so that it points to generated pb and pbtxt file. Input trace was generated by running trace.sh

Below is the stdout log at the end.

_Scheduling data (Data). Scheduling data_1 (Data). Scheduling data_10 (Data). Scheduling data_100 (Data). Scheduling data_101 (Data). Scheduling data_102 (Data). Scheduling data_103 (Data). Scheduling data_104 (Data). Scheduling data_105 (Data). Scheduling data_106 (Data). Scheduling data_107 (Data). Scheduling data_108 (Data). Scheduling data109 (Data).

stderr log before the backtrace shows the following message.

gem5 has encountered a segmentation fault!

Please, let me know if I configured something wrong. Thanks.

xyzsam commented 3 years ago

Yuan, can you take a look at this?

yaoyuannnn commented 3 years ago

Yes, will take a look this week.

yaoyuannnn commented 3 years ago

Just a guess, did you update trace_file_name in gem5.cfg to use the correct trace file?

daecheolyou commented 3 years ago

It doesn't need to be modified, but model_files was modified so that it points to pbtxt and pb file under imagenet-resnet. Trace file was generated with trace.sh, whose input is model_files and output file name is always dynamic_trace_acc0.gz.

yaoyuannnn commented 3 years ago

I just tried running resnet50, while it's still running but it has started running the accelerator for the first convolution layer (conv0), which clearly passed the point where your simulation crashed. In order to reduce the trace size for this relatively large network, the only different I made was using --sample-level=very_high in trash.sh (the same in run.sh). And other than updating the protobuf inputs, the rest of the configuration files are the same as the ones in sims/smv/tests/minerva.

xyzsam commented 3 years ago

Did the simulator leave any stacktraces indicating where the segfault occurred?

daecheolyou commented 3 years ago

Below is the stack trace for the simulation failure. I ran simulation several times with resnet, and sometimes it reached further than the log I originally posted. For example, it has reached until _Scheduling relu2b (ReLU). However, it encountered a segmentaion fault eventually with the same kind of stack trace below.

/workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z15print_backtracev+0x2c)[0x55a3fb5e722c] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(+0x6e92ff)[0x55a3fb5f92ff] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f8073fc9890] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0xcf)[0x7f80725f6d9f] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN6X86ISA7Decoder10decodeInstENS_11ExtMachInstE+0x2e6c1)[0x55a3fc00f141] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN6X86ISA7Decoder6decodeENS_11ExtMachInstEm+0x244)[0x55a3fbfa88f4] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN6X86ISA7Decoder6decodeERNS_7PCStateE+0x22b)[0x55a3fbfa8beb] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN12DefaultFetchI9O3CPUImplE5fetchERb+0x979)[0x55a3fbb0eb69] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN12DefaultFetchI9O3CPUImplE4tickEv+0xd3)[0x55a3fbb0fe23] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN9FullO3CPUI9O3CPUImplE4tickEv+0x12b)[0x55a3fbaedb3b] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_ZN10EventQueue10serviceOneEv+0xd9)[0x55a3fb5ef709] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z9doSimLoopP10EventQueue+0x148)[0x55a3fb610e28] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z8simulatem+0xcba)[0x55a3fb611dda] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(+0x7bf6d1)[0x55a3fb6cf6d1] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(+0x5e8754)[0x55a3fb4f8754] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x64d7)[0x7f8074276c47] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f80742705d9] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x6ac0)[0x7f8074277230] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5bf6)[0x7f8074276366] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7d8)[0x7f80743b5908] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x7f80742705d9] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x76)[0x7f80743206f6] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(_Z6m5MainiPPc+0x83)[0x55a3fb5f8013] /workspace/gem5-aladdin/src/aladdin/../../build/X86/gem5.opt(main+0x38)[0x55a3fb448e08]