Open jsfs2019 opened 3 years ago
Hi, ResNet-50 is a sequential model that does not have inter-operator parallelism. So, NNFusion w or w/o BlockFusion pass will have the same performance if other configurations (e.g., kernel, enabled passes) are the same.
The support of the BlockFusion pass in the main branch and the osdi20_artifact branch are the same. The native kernel.db in the artifact branch does not have cuda kernels for the ResNet-50 model, so NNFusion will use default kernels (e.g., cuBLAS for MatMul and cuDNN for Conv2D) in this compilation. The performance gap may be due to the different kernels in your two compilation.
ENV:
Steps to reproduce the behavior:
Generate the op_configs file of resnet50: resnet50_const_conv_kernels.txt
Run AutoTVM according to the op_configs file of resnet50 to generate the corresponding log and json files of resnet50
Run insert_db.sh to generate a new kernel_cache.db
Run resnet50:
nnfusion resnet50_v1.const_folded.pb -f tensorflow -b nnfusion -m graph -fkernel_fusion_level=3 -fblockfusion_level=1 -fconst_folding_backend=CUDA -fwarmup_step=5 -frun_step=1000 -fkernels_as_files=true -fkernels_as_files=true -fkernels_as_files=true -fkernels=60 -SXM4-40GB" -fbiasadd_fix=true -fpattern_substitution=true
In contrast, we use the native kernel.db to compile ResNet50:
nnfusion resnet50_v1.const_folded.pb -f tensorflow -b nnfusion -m graph -fkernel_fusion_level=3 -fblockfusion_level=1 -fconst_folding_backend=CUDA -fwarmup_step=5 -frun_step=1000 -fkernels_as_files=true -fkernels_files_number=60 -fproduct_name="Tesla V100-PCIE-16GB" -fbiasadd_fix=true -fpattern_substitution=true
.Expected behavior
When running with native kernel.db, we found in the log that several conv2d kernels were skipped in the blockfusion pass, while using the newly generated kernel.db we did not find these abnormal logs. We believe that nnfusion does not fully perform blockfusion scheduling in the native environment, and the new kernel.db provides a more complete environment for blockfusion. Therefore, performance should be improved rather than reduced. Could anyone help me explain this result? Thanks!