[BUG] Performance is not up to standard.

ghost commented 3 years ago

🐛 Bug

I try to reproduce the LSTM model according to the tutorial. https://github.com/microsoft/nnfusion/blob/osdi20_artifact/artifacts/get_started_tutorial/README_GET_STARTED.md

Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Iteration time 2.293056 ms
Iteration time 2.280224 ms
Iteration time 2.271168 ms
Iteration time 2.279776 ms
Iteration time 2.275872 ms
Summary: [min, max, mean] = [2.271168, 2.293056, 2.296928] ms

It seems that the final does not use the MatMul operator generated by tvm.

void Dot_float_float_float_cuda_lib_Dot_121(cublasHandle_t cublas_handle, float* input0, float* input1, float* output0)
{
    const float alpha = 1.0;
    const float beta = 0;
    CUBLAS_SAFE_CALL(cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 256, 1, 256, &alpha, static_cast<const float*>(input1), 256, static_cast<const float*>(input0), 256, &beta, static_cast<float*>(output0), 256));
}

The environment configuration

nnfbot commented 3 years ago

Thanks for the report @WarmHouse! I will look into it ASAP! (I'm a bot).

xysmlx commented 3 years ago

Hi, NNFusion leverages CUDA lib kernels (e.g., cuBLAS, cuDNN) by default. In the artifact, some kernels (e.g., Dot, Convolution) are tuned in TVM or implemented manually and are converted to rOperator kernels and injected into the kernel DB for better performance. Here is the example in the tutorial. And here is the document of building kernel DB for the artifact models from pre-tuned kernels.

microsoft / nnfusion

[BUG] Performance is not up to standard. #248