microsoft / nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
MIT License
965 stars 164 forks source link

[BUG] Performance is not up to standard. #248

Open ghost opened 3 years ago

ghost commented 3 years ago

🐛 Bug

I try to reproduce the LSTM model according to the tutorial. https://github.com/microsoft/nnfusion/blob/osdi20_artifact/artifacts/get_started_tutorial/README_GET_STARTED.md

Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Result_2110_0:
2.910247e-03 -2.200084e-03 1.088102e-03 1.566721e-03 5.218949e-03 -1.594784e-03 1.021980e-03 -3.463768e-05 1.831564e-03 5.588389e-03  .. (size = 256, ends with -8.857156e-05);
Iteration time 2.293056 ms
Iteration time 2.280224 ms
Iteration time 2.271168 ms
Iteration time 2.279776 ms
Iteration time 2.275872 ms
Summary: [min, max, mean] = [2.271168, 2.293056, 2.296928] ms

It seems that the final does not use the MatMul operator generated by tvm.

void Dot_float_float_float_cuda_lib_Dot_121(cublasHandle_t cublas_handle, float* input0, float* input1, float* output0)
{
    const float alpha = 1.0;
    const float beta = 0;
    CUBLAS_SAFE_CALL(cublasSgemm(cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 256, 1, 256, &alpha, static_cast<const float*>(input1), 256, static_cast<const float*>(input0), 256, &beta, static_cast<float*>(output0), 256));
}

The environment configuration image

nnfbot commented 3 years ago

Thanks for the report @WarmHouse! I will look into it ASAP! (I'm a bot).

xysmlx commented 3 years ago

Hi, NNFusion leverages CUDA lib kernels (e.g., cuBLAS, cuDNN) by default. In the artifact, some kernels (e.g., Dot, Convolution) are tuned in TVM or implemented manually and are converted to rOperator kernels and injected into the kernel DB for better performance. Here is the example in the tutorial. And here is the document of building kernel DB for the artifact models from pre-tuned kernels.