[inductor][cpu]shufflenet_v2_x1_0 QAT performance regression in 2024-04-20 nightly release

zxd1997066 commented 3 months ago

🐛 Describe the bug

shufflenet_v2_x1_0 QAT performance regression

model_name	qat_new	qat_old	qat ratio(new/old)
shufflenet_v2_x1_0-eval_throughput	5728.255241029094	6910.654523206949	0.83

SW info

SW	Nightly commit	Main commit
Pytorch	02b1ebb
Torchbench	/	ee35d764
torchaudio	ea437b3
torchtext	b0ebddc
torchvision	2c4665f
torchdata	0790338
dynamo_benchmarks	nightly	/

Reference SW info (nightly)

item	commit
torchbench	ee35d764
torch	2.4.0a0+gitd8b31eb
torchvision	0.19.0a0+2c4665f
torchtext	0.16.0a0+b0ebddc
torchaudio	2.2.0a0+ea437b3
torchdata	0.7.1a0+0790338
dynamo_benchmarks	nightly

Repro: ``` git clone -b chuanqiw/inductor_quant https://github.com/pytorch/benchmark.git cd benchmark pip install --no-deps -r requirements.txt pip install --no-cache Jinja2==3.1.2 markupsafe==2.0.1 beartype==0.15.0 && pip install mpmath==1.3.0 python install.py --continue_on_fail export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" #QAT TORCHINDUCTOR_FREEZING=1 python run_benchmark.py cpu -m shufflenet_v2_x1_0 --torchdynamo inductor --quantize --is_qat --launcher --launcher-args="--throughput-mode" -b 128 --metrics throughputs mv .userbenchmark/cpu qat cat qat/metric* # to see the results ``` Suspected guilty commit: https://github.com/pytorch/pytorch/commit/51a56efbb91377237eb1b81d72c7d598ad61b14e [torchbench-shufflenet_v2_x1_0-inference-qat-performance-drop_guilty_commit.log](https://github.com/pytorch/pytorch/files/15104655/torchbench-shufflenet_v2_x1_0-inference-qat-performance-drop_guilty_commit.log) cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @WeizhuoZhang-intel @chuanqi129

jiayisunx commented 3 months ago

With the guilty commit, the output format of Concat in the model changed from contiguous to channels_last, which caused the order of the for loops in codegen kennels to change, the innermost for loop became the channel dimension, which caused two issues in this model:

In the vectorization process, there is additional scalar processing for the tail. such as:

            for(long x1=static_cast<long>(0L); x1<static_cast<long>(48L); x1+=static_cast<long>(16L))
            {
                auto tmp0 =
                [&]
                {
                    __at_align__ std::array<unsigned char, 64> tmpbuf;
                    #pragma GCC unroll 16
                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                    {
                        tmpbuf[x1_inner] = in_ptr0[static_cast<long>((58L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(2L))) + (116L*x0) + (c10::div_floor_integer((x1 + x1_inner), 2L)))];
                    }
                    return at::vec::Vectorized<unsigned char>::loadu(tmpbuf.data(), 16);
                }
                ()
                ;
                auto tmp1 = at::vec::convert<float>(tmp0);
                auto tmp2 = static_cast<float>(0.0);
                auto tmp3 = at::vec::Vectorized<float>(tmp2);
                auto tmp4 = tmp1 - tmp3;
                auto tmp5 = static_cast<float>(0.013786641880869865);
                auto tmp6 = at::vec::Vectorized<float>(tmp5);
                auto tmp7 = tmp4 * tmp6;
                tmp7.store(out_ptr0 + static_cast<long>(x1 + (116L*x0)));
            }
            #pragma omp simd simdlen(8) 
            for(long x1=static_cast<long>(48L); x1<static_cast<long>(58L); x1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>((58L*(static_cast<long>(x1) % static_cast<long>(2L))) + (116L*x0) + (c10::div_floor_integer(x1, 2L)))];
                auto tmp1 = c10::convert<float>(tmp0);
                auto tmp2 = static_cast<float>(0.0);
                auto tmp3 = decltype(tmp1)(tmp1 - tmp2);
                auto tmp4 = static_cast<float>(0.013786641880869865);
                auto tmp5 = decltype(tmp3)(tmp3 * tmp4);
                out_ptr0[static_cast<long>(x1 + (116L*x0))] = tmp5;
            }

More non-contiguous loads:

                auto tmp0 =
                [&]
                {
                    __at_align__ std::array<float, 16> tmpbuf;
                    #pragma GCC unroll 16
                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                    {
                        tmpbuf[x1_inner] = in_ptr2[static_cast<long>(29L + (58L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(2L))) + (116L*x0) + (c10::div_floor_integer((x1 + x1_inner), 2L)))];
                    }
                    return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                }
                ()
                ;

leslie-fang-intel commented 3 months ago

Try masked vectorization for tail case.

pytorch / pytorch

[inductor][cpu]shufflenet_v2_x1_0 QAT performance regression in 2024-04-20 nightly release #124913

🐛 Describe the bug