sarchlab / mgpusim

A highly-flexible GPU simulator for AMD GPUs.
MIT License
93 stars 21 forks source link

Failed Execution for Program Compiled with ROCm 1.7.1 #64

Open MyNewAcc1234 opened 5 months ago

MyNewAcc1234 commented 5 months ago

To Reproduce MGPUSim version of commit ID: c72c904 ROCm 1.7.1 clang-ocl 4.0

Command that recreates the problem Only modify the matrixmultiplication kernel to:

#define TILE_SIZE 8

/* Matrix A is cached into local memory block */
/* Required global threads = (widthC / 4, heightC / 4) */
__kernel void mmmKernel_local(__global float *matrixA,
                              __global float *matrixB,
                              __global float* matrixC,
                              int widthA,
                              __local float *blockA)
{
    int thd_x = get_local_id(0);
    int thd_y = get_local_id(1);
    int tb_size_x = get_local_size(0);
    int global_x = get_global_id(0);
    int global_y = get_global_id(1);
    int global_size_x = get_global_size(0);
    int numLoop = widthA / TILE_SIZE;
    int rowA = global_y;
    int colA;
    int rowB;
    int colB = global_x;
    int idxA;
    int idxB;

    float result = 0;

    for(int i=0; i<numLoop; ++i)
    {
        colA = i * TILE_SIZE + thd_x;
        rowB = i * TILE_SIZE + thd_y;
        idxA = rowA * widthA + colA;
        idxB = rowB * widthA + colB;

        blockA[thd_y * TILE_SIZE + thd_x] = matrixA[idxA];
        blockA[TILE_SIZE * TILE_SIZE + thd_y * TILE_SIZE + thd_x] = matrixB[idxB];

        barrier(CLK_LOCAL_MEM_FENCE);

        for(int j=0; j<TILE_SIZE; ++j)
        {
            result = result + blockA[thd_y * TILE_SIZE + j] * blockA[TILE_SIZE * TILE_SIZE + j * TILE_SIZE + thd_x];
        }

        barrier(CLK_LOCAL_MEM_FENCE);
    }
    matrixC[global_y * widthA + global_x] = result;

}

And then compile it with ROCm 1.7.1, which is the version described in the original paper. The version of clang-ocl is 4.0. Compile instruction is

/opt/rocm/bin/clang-ocl -mcpu=gfx803 GEMM.cl -o mykernels.hsaco

Then execute the samples/matrixmultiplication in emulation mode with

./matrixmultiplication

Current behavior

2024/05/26 09:40:23 /sim/mgpusim/emu/aluds.go:25: Opcode 15 for DS format is not implemented
2024/05/26 09:40:23 /sim/mgpusim/driver/driver.go:117: Panic: Opcode 15 for DS format is not implemented
goroutine 167 [running]:
runtime/debug.Stack()
        /go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
        /go/src/runtime/debug/stack.go:16 +0x13
github.com/sarchlab/mgpusim/v3/driver.(*Driver).runEngine.func1()
        /sim/mgpusim/driver/driver.go:118 +0xaa
panic({0x956cc0?, 0xc000484000?})
        /go/src/runtime/panic.go:770 +0x132
log.Panicf({0xa12007?, 0x4?}, {0xc00018bba8?, 0x1000?, 0x3f?})
        /go/src/log/log.go:439 +0x65
github.com/sarchlab/mgpusim/v3/emu.(*ALUImpl).runDS(0xc00017e140, {0xbfa550, 0xc0003d4e60})
        /sim/mgpusim/emu/aluds.go:25 +0x134
github.com/sarchlab/mgpusim/v3/emu.(*ALUImpl).Run(0xc00017e140, {0xbfa550, 0xc0003d4e60})
        /sim/mgpusim/emu/alu.go:77 +0x16f
github.com/sarchlab/mgpusim/v3/emu.(*ComputeUnit).executeInst(0xc00029c000, 0xc0003d4e60)
        /sim/mgpusim/emu/computeunit.go:352 +0x55
github.com/sarchlab/mgpusim/v3/emu.(*ComputeUnit).runWfUntilBarrier(0xc00029c000, 0xc0003d4e60)
        /sim/mgpusim/emu/computeunit.go:334 +0x28
github.com/sarchlab/mgpusim/v3/emu.(*ComputeUnit).runWG(0xc00029c000, 0xc000267080, 0x3ff0000000000000)
        /sim/mgpusim/emu/computeunit.go:145 +0x1c5
github.com/sarchlab/mgpusim/v3/emu.(*ComputeUnit).runEmulation(0xc00029c000, 0xc00017a3e8)
        /sim/mgpusim/emu/computeunit.go:130 +0x4f
github.com/sarchlab/mgpusim/v3/emu.(*ComputeUnit).Handle(0xc00029c000, {0xbfa5b0?, 0xc00017a3e8?})
        /sim/mgpusim/emu/computeunit.go:87 +0xdf
github.com/sarchlab/akita/v3/sim.(*SerialEngine).Run(0xc0001ea360)
        /root/go/pkg/mod/github.com/sarchlab/akita/v3@v3.0.0/sim/serialengine.go:96 +0x357
github.com/sarchlab/mgpusim/v3/driver.(*Driver).runEngine(0xc0001fc820)
        /sim/mgpusim/driver/driver.go:125 +0xaa
created by github.com/sarchlab/mgpusim/v3/driver.(*Driver).runAsync in goroutine 163
        /sim/mgpusim/driver/driver.go:108 +0x18a

Additional context It seems that the compiler version I used is not suitable. But it is the version described in ISCA'19 paper. So I am not sure what is the reason for the problem. If it is because of the version, could someone please tell me what is the right ROCm version and clang-ocl version? Moreover, I was wondering if a docker can be provided for MGPUSim to avoid such version-related problems.

syifan commented 5 months ago

One question is, why not use the pre-compiled kernel?

There is no certain compiler version that MGPUSim can guarantee support. MGPUSim only implements the instructions it encounters. So, if it is a new kernel, it is very likely to generate some instructions that are not implemented.

For docker image, we use the ROCm official docker image with version 3.8.

MyNewAcc1234 commented 5 months ago

One question is, why not use the pre-compiled kernel?

There is no certain compiler version that MGPUSim can guarantee support. MGPUSim only implements the instructions it encounters. So, if it is a new kernel, it is very likely to generate some instructions that are not implemented.

For docker image, we use the ROCm official docker image with version 3.8.

Thanks for your reply.

I selected to use a new kernel because I would like to learn how to add a new benchmark in general and am planning to implement some more complex workloads.

Thanks for your information, and I will have a try based on the ROCm 3.8 docker image.