sarchlab / mgpusim

A highly-flexible GPU simulator for AMD GPUs.
MIT License
92 stars 20 forks source link

Sometimes -trace-vis doens't work with -unified-gpus and -use-unified-memory #99

Open JunjoFor opened 2 weeks ago

JunjoFor commented 2 weeks ago

To Reproduce MGPUSim version of commit ID: 1596eeab6f73a98dbc435d2cca51afba1e9fc998

Command that recreates the problem ./fft -unified-gpus=1,2,3,4 -timing -use-unified-memory -trace-vis

Current behavior The simulation crashes with a Panic: UNIQUE constraint failed: trace.task_id.

Full error:

Trace is Collected in Database: akita_trace_cs7orts4va1ev4r88to0.sqlite3
Monitoring simulation with http://localhost:37431
{msg_7823990_e2e 7823990_req_out msg_e2e msg_e2e PCIe.EndPoint[1] 0.0005163 0.00051645 [] 0xc03b8e9a40 <nil>}
2024/10/16 11:59:06 /home/mgpusim/driver/driver.go:117: Panic: UNIQUE constraint failed: trace.task_id
goroutine 176 [running]:
runtime/debug.Stack()
        /usr/lib/go-1.18/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
        /usr/lib/go-1.18/src/runtime/debug/stack.go:16 +0x19
github.com/sarchlab/mgpusim/v3/driver.(*Driver).runEngine.func1()
        /home/mgpusim/driver/driver.go:118 +0x58
panic({0xa748c0, 0xc042005800})
        /usr/lib/go-1.18/src/runtime/panic.go:838 +0x207
github.com/sarchlab/akita/v3/tracing.(*SQLiteTraceWriter).Flush(0xc00007fac0)
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/tracing/sqlite.go:73 +0x399
github.com/sarchlab/akita/v3/tracing.(*SQLiteTraceWriter).Write(0xc00007fac0, {{0xc03733d9d0, 0xf}, {0xc03739c2d0, 0x24}, {0xab5275, 0x7}, {0x9deac4, 0x12}, {0xc00011a5e0, ...}, ...})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/tracing/sqlite.go:47 +0xe5
github.com/sarchlab/akita/v3/tracing.(*DBTracer).EndTask(0xc00007fd00, {{0xc03767e6a0, 0xf}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/tracing/dbtracer.go:83 +0x182
github.com/sarchlab/akita/v3/tracing.(*traceHook).Func(0xc000100000?, {{0x7f9b196f54b0, 0xc00027b860}, 0xfdb0a0, {0xa8dbe0, 0xc03768dcb0}, {0x0, 0x0}})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/tracing/tracehook.go:39 +0x23e
github.com/sarchlab/akita/v3/sim.(*HookableBase).InvokeHook(0xa8dbe0?, {{0x7f9b196f54b0, 0xc00027b860}, 0xfdb0a0, {0xa8dbe0, 0xc03768dcb0}, {0x0, 0x0}})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/sim/hook.go:81 +0xc4
github.com/sarchlab/akita/v3/tracing.EndTask({0xc03767e6a0, 0xf}, {0xcc9b80, 0xc00027b860})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/tracing/api.go:151 +0x155
github.com/sarchlab/akita/v3/tracing.TraceReqFinalize({0xcc4a80?, 0xc03738e2a0?}, {0xcc9b80, 0xc00027b860})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/tracing/api.go:208 +0x5a
github.com/sarchlab/akita/v3/mem/vm/addresstranslator.(*AddressTranslator).parseTranslation(0xc00027b860, 0x3f4123c0f1c4a050)
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/mem/vm/addresstranslator/addresstranslator.go:221 +0x45a
github.com/sarchlab/akita/v3/mem/vm/addresstranslator.(*AddressTranslator).runPipeline(0xc00027b860, 0xcc4c80?)
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/mem/vm/addresstranslator/addresstranslator.go:83 +0x69
github.com/sarchlab/akita/v3/mem/vm/addresstranslator.(*AddressTranslator).Tick(0xc00027b860, 0xc0003088e0?)
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/mem/vm/addresstranslator/addresstranslator.go:67 +0x32
github.com/sarchlab/akita/v3/sim.(*TickingComponent).Handle(0xc0003088e0, {0xcc8c90?, 0xc041e8b080?})
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/sim/ticker.go:140 +0x45
github.com/sarchlab/akita/v3/sim.(*SerialEngine).Run(0xc0000ea480)
        /home/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/sim/serialengine.go:96 +0x367
github.com/sarchlab/mgpusim/v3/driver.(*Driver).runEngine(0xc0000b31e0)
        /home/mgpusim/driver/driver.go:125 +0xaa
created by github.com/sarchlab/mgpusim/v3/driver.(*Driver).runAsync
        /home/mgpusim/driver/driver.go:108 +0x18a
{7687625 7687522@GPU[3].SA[15].L1ICache cache_transaction read GPU[3].SA[15].L1ICache.Local 0.00051437 0.000514374 [] <nil> <nil>}
error: atexit handler error: UNIQUE constraint failed: trace.task_id
{7687625 7687522@GPU[3].SA[15].L1ICache cache_transaction read GPU[3].SA[15].L1ICache.Local 0.00051437 0.000514374 [] <nil> <nil>}
error: atexit handler error: UNIQUE constraint failed: trace.task_id

Expected behavior The simulation doesn't crash

Additional context It also happens when I try to run fir with the same options and extended length

syifan commented 2 weeks ago

@JunjoFor The current implementation of unified memory is buggy. We are currently working on a project that will entirely revamp the implementation of unified memory. Do you need this problem urgently solved? I may give you some suggestions on how to avoid this problem.

JunjoFor commented 2 weeks ago

The visualisation of the tasks is cool, but not a must for me. I reported the issue to let you know. Nice to know that you are working on revising the unified memory implementation. I would like to implement a L3 TLB shared by all GPUs like in https://arxiv.org/pdf/2404.18361 in the near future. I don't know how the revamp might interfere with the implementation of that (with the idea of merging that into the main repository one day).