pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
81.53k stars 21.88k forks source link

[inductor][cpu]shufflenet_v2_x1_0 QAT performance regression in 2024-04-20 nightly release #124913

Open zxd1997066 opened 3 months ago

zxd1997066 commented 3 months ago

🐛 Describe the bug

shufflenet_v2_x1_0 QAT performance regression

model_name qat_new qat_old qat ratio(new/old)
shufflenet_v2_x1_0-eval_throughput 5728.255241029094 6910.654523206949 0.83

SW info

SW Nightly commit Main commit
Pytorch 02b1ebb
Torchbench / ee35d764
torchaudio ea437b3
torchtext b0ebddc
torchvision 2c4665f
torchdata 0790338
dynamo_benchmarks nightly /

Reference SW info (nightly)

item commit
torchbench ee35d764
torch 2.4.0a0+gitd8b31eb
torchvision 0.19.0a0+2c4665f
torchtext 0.16.0a0+b0ebddc
torchaudio 2.2.0a0+ea437b3
torchdata 0.7.1a0+0790338
dynamo_benchmarks nightly
Repro: ``` git clone -b chuanqiw/inductor_quant https://github.com/pytorch/benchmark.git cd benchmark pip install --no-deps -r requirements.txt pip install --no-cache Jinja2==3.1.2 markupsafe==2.0.1 beartype==0.15.0 && pip install mpmath==1.3.0 python install.py --continue_on_fail export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" #QAT TORCHINDUCTOR_FREEZING=1 python run_benchmark.py cpu -m shufflenet_v2_x1_0 --torchdynamo inductor --quantize --is_qat --launcher --launcher-args="--throughput-mode" -b 128 --metrics throughputs mv .userbenchmark/cpu qat cat qat/metric* # to see the results ``` Suspected guilty commit: https://github.com/pytorch/pytorch/commit/51a56efbb91377237eb1b81d72c7d598ad61b14e [torchbench-shufflenet_v2_x1_0-inference-qat-performance-drop_guilty_commit.log](https://github.com/pytorch/pytorch/files/15104655/torchbench-shufflenet_v2_x1_0-inference-qat-performance-drop_guilty_commit.log) cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @WeizhuoZhang-intel @chuanqi129
jiayisunx commented 3 months ago

With the guilty commit, the output format of Concat in the model changed from contiguous to channels_last, which caused the order of the for loops in codegen kennels to change, the innermost for loop became the channel dimension, which caused two issues in this model:

  1. In the vectorization process, there is additional scalar processing for the tail. such as:
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(48L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 =
                    [&]
                    {
                        __at_align__ std::array<unsigned char, 64> tmpbuf;
                        #pragma GCC unroll 16
                        for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                        {
                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>((58L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(2L))) + (116L*x0) + (c10::div_floor_integer((x1 + x1_inner), 2L)))];
                        }
                        return at::vec::Vectorized<unsigned char>::loadu(tmpbuf.data(), 16);
                    }
                    ()
                    ;
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(0.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.013786641880869865);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    tmp7.store(out_ptr0 + static_cast<long>(x1 + (116L*x0)));
                }
                #pragma omp simd simdlen(8) 
                for(long x1=static_cast<long>(48L); x1<static_cast<long>(58L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>((58L*(static_cast<long>(x1) % static_cast<long>(2L))) + (116L*x0) + (c10::div_floor_integer(x1, 2L)))];
                    auto tmp1 = c10::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(0.0);
                    auto tmp3 = decltype(tmp1)(tmp1 - tmp2);
                    auto tmp4 = static_cast<float>(0.013786641880869865);
                    auto tmp5 = decltype(tmp3)(tmp3 * tmp4);
                    out_ptr0[static_cast<long>(x1 + (116L*x0))] = tmp5;
                }
  2. More non-contiguous loads:
                    auto tmp0 =
                    [&]
                    {
                        __at_align__ std::array<float, 16> tmpbuf;
                        #pragma GCC unroll 16
                        for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                        {
                            tmpbuf[x1_inner] = in_ptr2[static_cast<long>(29L + (58L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(2L))) + (116L*x0) + (c10::div_floor_integer((x1 + x1_inner), 2L)))];
                        }
                        return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                    }
                    ()
                    ;
leslie-fang-intel commented 3 months ago

Try masked vectorization for tail case.