taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.37k stars 2.27k forks source link

Android Aot Vulkan GPU Performance #5685

Open helloworldstone opened 2 years ago

helloworldstone commented 2 years ago

hi, everyone!

I write an aot vulkan demo on Android like this https://github.com/taichi-dev/taichi-aot-demo/blob/master/implicit_fem/include/fem_app.h,and compiled an binary executable file with cmake "add_executable".

When I run the binary executable file, I found the vulkan “kernel” performance is much worse than opencl. “Kernel” running time like this(running 100 times to get the average ): vulkan_time = launch+synchronize opencl_time = clEnqueueNDRangeKernel + clFinish

I found one of the reasons is GPU frequency. When the vulkan demo runs, the GPU frequency is at half of the highest frequency. So,

  1. Is there a taichi c++ api to boost GPU frequency?
  2. Except the GPU frequency, is there anything else I should do to improve vulkan "kernel" performance?
  3. Do you have a plan to support Opencl?

Thanks a lot! Looking forward to your reply!

ailzhang commented 2 years ago

Hey @helloworldstone , wow it's pretty impressive to see you get it up running! We haven't done any benchmarks comparing to opencl yet, would you mind sharing your benchmark repo so that we can give it a try as well? Thanks a lot! cc: @turbo0628 who might be interested :D

helloworldstone commented 2 years ago

hi, @ailzhang ,sorry about it,I don't have a benchmark repo. I paste the code below.

On Snapdragon888, opencl kernel cost 2.21 ms, vulkan kernel cost 8.02ms.

C++:

int main() { const std::vector extensions = { VK_KHR_SURFACE_EXTENSION_NAME, VK_KHR_ANDROID_SURFACE_EXTENSION_NAME, VK_KHR_GET_PHYSICAL_DEVICE_PROPERTIES_2_EXTENSION_NAME, }; // Create a Vulkan Device taichi::lang::vulkan::VulkanDeviceCreator::Params evd_params; evd_params.api_version = VK_API_VERSION_1_2; evd_params.additional_instance_extensions = extensions; evd_params.additional_device_extensions = {VK_KHR_SWAPCHAIN_EXTENSION_NAME}; evd_params.is_for_ui = false; evd_params.surface_creator = nullptr; std::unique_ptr embeddeddevice = std::make_unique(evd_params);

taichi::lang::vulkan::VulkanDevice* device_ = static_cast<taichi::lang::vulkan::VulkanDevice*>(
    embedded_device_->device());

// Initialize our Vulkan Program pipeline
std::vector<uint64_t> host_result_buffer_;
host_result_buffer_.resize(taichi_result_buffer_entries);
taichi::lang::gfx::GfxRuntime::Params params;
params.host_result_buffer = host_result_buffer_.data();
params.device = embedded_device_->device();
std::unique_ptr<taichi::lang::gfx::GfxRuntime> vulkan_runtime_ =
    std::make_unique<taichi::lang::gfx::GfxRuntime>(std::move(params));

// Get kernels
std::string shader_source = "/data/local/tmp/test_demo/aot";
taichi::lang::gfx::AotModuleParams aot_params{shader_source, vulkan_runtime_.get()};
std::unique_ptr<taichi::lang::aot::Module> module_ = taichi::lang::aot::Module::load(taichi::Arch::vulkan, aot_params);
auto root_size = module_->get_root_size();
vulkan_runtime_->add_root_buffer(root_size);
taichi::lang::aot::Kernel* fast_gauss_kernel = module_->get_kernel("fast_gauss");

const int imgW = 1080;
const int imgH = 1920;
// Prepare Ndarray for model
taichi::lang::Device::AllocParams alloc_params;
alloc_params.usage = taichi::lang::AllocUsage::Storage;
alloc_params.host_write = true;
alloc_params.host_read  = false;
alloc_params.size = imgW * imgH * sizeof(float);
taichi::lang::DeviceAllocation devalloc_inImg_  = device_->allocate_memory(alloc_params);

alloc_params.size = 25 * sizeof(float);
taichi::lang::DeviceAllocation devalloc_gaussCoefs  = device_->allocate_memory(alloc_params);

alloc_params.host_write = false;
alloc_params.host_read  = true;
alloc_params.size = imgW * imgH * sizeof(float);
taichi::lang::DeviceAllocation devalloc_outImg_ = device_->allocate_memory(alloc_params);

float coefs[25] = 
{0.00390625, 0.015625, 0.0234375,0.015625,0.00390625,
0.015625,0.0625,0.09375,0.0625,0.015625,
0.0234375,0.09375,0.140625,0.09375,0.0234375,
0.015625,0.0625,0.09375,0.0625,0.015625,
0.00390625,0.015625,0.0234375,0.015625,0.00390625};

float* tmpBuf    = (float*)malloc(imgW * imgH * sizeof(float));
float* tmpBufOut = (float*)malloc(imgW * imgH * sizeof(float));
for(int i =0; i < imgW * imgH; i++) {
    tmpBuf[i] = i % 256;
}

load_data(vulkan_runtime_.get(), devalloc_inImg_, tmpBuf, imgW * imgH * sizeof(float));
load_data(vulkan_runtime_.get(), devalloc_gaussCoefs, coefs, 5 * 5 * sizeof(float));
vulkan_runtime_->synchronize();

taichi::lang::RuntimeContext host_ctx_;
memset(&host_ctx_, 0, sizeof(taichi::lang::RuntimeContext));
host_ctx_.result_buffer = host_result_buffer_.data();

float time_use_run; 
struct timeval tpstart_run, tpend_run; 
gettimeofday(&tpstart_run, nullptr);
for(int i =0; i < 100; i++) {
    host_ctx_.set_arg_devalloc(0, devalloc_inImg_, {imgH, imgW});
    host_ctx_.set_arg_devalloc(1, devalloc_outImg_, {imgH, imgW});
    host_ctx_.set_arg_devalloc(2, devalloc_gaussCoefs, {5, 5});
    fast_gauss_kernel->launch(&host_ctx_);
    vulkan_runtime_->synchronize();
}
gettimeofday(&tpend_run, nullptr); 
time_use_run = 1000000 * (tpend_run.tv_sec - tpstart_run.tv_sec) + tpend_run.tv_usec - tpstart_run.tv_usec; 
printf("[vulkan run time] = %40f ms. \n", time_use_run / 1000 / 100);

//debug
read_data(vulkan_runtime_.get(), devalloc_outImg_, tmpBufOut, imgW * imgH * sizeof(float));
vulkan_runtime_->synchronize();
int startIdx = imgH * imgW / 2 + imgW / 2;
for(int i = startIdx ; i < (startIdx + 10); i++) {
    float* tmpPtr  = (float*)tmpBufOut;
    printf("idx  = %d val = %f \n", i, tmpPtr[i]);

}
device_->dealloc_memory(devalloc_inImg_);
device_->dealloc_memory(devalloc_outImg_);
device_->dealloc_memory(devalloc_gaussCoefs);

free(tmpBuf);    tmpBuf = nullptr;
free(tmpBufOut); tmpBufOut = nullptr;
return 0;

}

Python taichi kernel: height = 1920 width = 1080 inputImg = ti.ndarray(dtype=ti.f32, shape=(height, width)) outputImg = ti.ndarray(dtype=ti.f32, shape=(height, width)) kernelCoffes = ti.ndarray(dtype=ti.f32, shape=(5, 5)) @ti.kernel def fast_gauss(inputImg: ti.types.ndarray(), outputImg: ti.types.ndarray(), kernelCoffes: ti.types.ndarray()): for pi,pj in ti.ndrange(height -2 , width -2): if (pi >= 2 and pj >= 2): outputImg[pi, pj] =(inputImg[pi + 0 - 2, pj + 0 - 2] kernelCoffes[0, 0] + inputImg[pi + 0 - 2, pj + 1 - 2] kernelCoffes[0, 1] + inputImg[pi + 0 - 2, pj + 2 - 2] kernelCoffes[0, 2] + inputImg[pi + 0 - 2, pj + 3 - 2] kernelCoffes[0, 3] + inputImg[pi + 0 - 2, pj + 4 - 2] kernelCoffes[0, 4] + inputImg[pi + 1 - 2, pj + 0 - 2] kernelCoffes[1, 0] + inputImg[pi + 1 - 2, pj + 1 - 2] kernelCoffes[1, 1] + inputImg[pi + 1 - 2, pj + 2 - 2] kernelCoffes[1, 2] + inputImg[pi + 1 - 2, pj + 3 - 2] kernelCoffes[1, 3] + inputImg[pi + 1 - 2, pj + 4 - 2] kernelCoffes[1, 4] + inputImg[pi + 2 - 2, pj + 0 - 2] kernelCoffes[2, 0] + inputImg[pi + 2 - 2, pj + 1 - 2] kernelCoffes[2, 1] + inputImg[pi + 2 - 2, pj + 2 - 2] kernelCoffes[2, 2] + inputImg[pi + 2 - 2, pj + 3 - 2] kernelCoffes[2, 3] + inputImg[pi + 2 - 2, pj + 4 - 2] kernelCoffes[2, 4] + inputImg[pi + 3 - 2, pj + 0 - 2] kernelCoffes[3, 0] + inputImg[pi + 3 - 2, pj + 1 - 2] kernelCoffes[3, 1] + inputImg[pi + 3 - 2, pj + 2 - 2] kernelCoffes[3, 2] + inputImg[pi + 3 - 2, pj + 3 - 2] kernelCoffes[3, 3] + inputImg[pi + 3 - 2, pj + 4 - 2] kernelCoffes[3, 4] + inputImg[pi + 4 - 2, pj + 0 - 2] kernelCoffes[4, 0] + inputImg[pi + 4 - 2, pj + 1 - 2] kernelCoffes[4, 1] + inputImg[pi + 4 - 2, pj + 2 - 2] kernelCoffes[4, 2] + inputImg[pi + 4 - 2, pj + 3 - 2] kernelCoffes[4, 3] + inputImg[pi + 4 - 2, pj + 4 - 2] * kernelCoffes[4, 4]) `

bobcao3 commented 2 years ago

For problem 1 it's something we encountered ourselves during testing. On mobile it seems the OS sometimes just doesn't want to treat our application as interactive or game and do not boost performance, and I don't think we can provide an API to pin the frequency (not possible).

For 2, in general try to avoid the "synchronize" call. This is a major factor in slowing down everything. Other performance tips very much vary by each program.

For 3, not sure we have a plan yet. OpenCL is quite "dead" as an API rn and support from hardware vendors is sparse

helloworldstone commented 2 years ago

@bobcao3 OK,thanks a lot!