Open shifeiwen opened 1 month ago
Thanks for the valuable feedback! I wonder whether you find a smaller unroll number that can work on your side?
Hi, @MasterJH5574 I followed the instructions in gemv and set the loop unrolling value to 8. Running the opencl kernel on 8gen2 does not cause errors. There is no significant difference in speed.
Feel free to send a PR!
gentle ping @shifeiwen
@tqchen I will submit a PR later
š Bug
To Reproduce
I found that when compiling the Android OpenCL kernel, the loop expansion may be too large, which may cause some codes to be too large. When going through clBuildProgram, it will cause the video memory to exceed. In Florence2, I followed the original setting of 64, which will report an error on 8gen2 phones, but not on 8gen3 phones. I changed it to 8 and it can run normally on 8gen2 phones. I understand that the opencl kernel of the previous mini_cpm model also has this problem when running on 8gen2 phones.
error pipline
https://github.com/mlc-ai/relax/blob/f5f048bbd71513f087799f987019e3931f68a6d9/python/tvm/dlight/gpu/matmul.py#L789
error msg 3rdparty/tvm/src/runtime/opencl/opencl_module.cc:264: OpenCL build error for device=0x7b1d4dcb58 Error: CL_OUT_OF_HOST_MEMORY
error cpp code err = clBuildProgram(programs_[func_name][device_id], 1, &dev, nullptr, nullptr, nullptr);
error_opencl_kernel.txt
This error can be alleviated by appropriately reducing the unroll parameter and the number of loop unrolling layers to control the size of the kernel code.
Expected behavior
Environment
conda
, source):pip
, source):python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context