microsoft / nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
MIT License
957 stars 161 forks source link

No performance improvement with NNFusion on ResNet50 #327

Open jsfs2019 opened 3 years ago

jsfs2019 commented 3 years ago

ENV:

Steps to reproduce the behavior:

  1. Generate the op_configs file of resnet50: resnet50_const_conv_kernels.txt

  2. Run AutoTVM according to the op_configs file of resnet50 to generate the corresponding log and json files of resnet50 image

  3. Run insert_db.sh to generate a new kernel_cache.db

  4. Run resnet50:

    1. Command: nnfusion resnet50_v1.const_folded.pb -f tensorflow -b nnfusion -m graph -fkernel_fusion_level=3 -fblockfusion_level=1 -fconst_folding_backend=CUDA -fwarmup_step=5 -frun_step=1000 -fkernels_as_files=true -fkernels_as_files=true -fkernels_as_files=true -fkernels=60 -SXM4-40GB" -fbiasadd_fix=true -fpattern_substitution=true
    2. We get infer time: Summary: [min, max, mean] = [3.184192, 8.624928, 3.312613] ms
  5. In contrast, we use the native kernel.db to compile ResNet50:

    1. Command: nnfusion resnet50_v1.const_folded.pb -f tensorflow -b nnfusion -m graph -fkernel_fusion_level=3 -fblockfusion_level=1 -fconst_folding_backend=CUDA -fwarmup_step=5 -frun_step=1000 -fkernels_as_files=true -fkernels_files_number=60 -fproduct_name="Tesla V100-PCIE-16GB" -fbiasadd_fix=true -fpattern_substitution=true.
    2. Update CMakeList.txt: add -gencode arch=compute_80,code=sm_80 to support running on Ampere architecture.
    3. We get Infer time: Summary: [min, max, mean] = [2.643968, 6.354400, 2.747237] ms

Expected behavior

When running with native kernel.db, we found in the log that several conv2d kernels were skipped in the blockfusion pass, while using the newly generated kernel.db we did not find these abnormal logs. We believe that nnfusion does not fully perform blockfusion scheduling in the native environment, and the new kernel.db provides a more complete environment for blockfusion. Therefore, performance should be improved rather than reduced. Could anyone help me explain this result? Thanks!

xysmlx commented 3 years ago

Hi, ResNet-50 is a sequential model that does not have inter-operator parallelism. So, NNFusion w or w/o BlockFusion pass will have the same performance if other configurations (e.g., kernel, enabled passes) are the same.

The support of the BlockFusion pass in the main branch and the osdi20_artifact branch are the same. The native kernel.db in the artifact branch does not have cuda kernels for the ResNet-50 model, so NNFusion will use default kernels (e.g., cuBLAS for MatMul and cuDNN for Conv2D) in this compilation. The performance gap may be due to the different kernels in your two compilation.