intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
127 stars 36 forks source link

[Accuracy] Summary for `torchbench` models failed in Inductor accuracy check #438

Closed ESI-SYD closed 5 months ago

ESI-SYD commented 7 months ago

Accuracy check results of torchbench models based on triton 3.0.0 (6 test scenarios in total)

Test mode: inference and training Test datatype: amp_bf16 amp_fp16 float32

This issue can be split into multiple work items

Failed model list

=======================Failed in every scenario=============================
name    accuracy
Background_Matting  infra_error
LearningToPaint infra_error
torchrec_dlrm   infra_error
detectron2_fasterrcnn_r_101_c4  infra_error
detectron2_fasterrcnn_r_101_dc5 infra_error
detectron2_fasterrcnn_r_101_fpn infra_error
detectron2_fasterrcnn_r_50_c4   infra_error
detectron2_fasterrcnn_r_50_dc5  infra_error
detectron2_fasterrcnn_r_50_fpn  infra_error
detectron2_fcos_r_50_fpn    infra_error
detectron2_maskrcnn infra_error
detectron2_maskrcnn_r_101_c4    infra_error
detectron2_maskrcnn_r_101_fpn   infra_error
detectron2_maskrcnn_r_50_c4 infra_error
detectron2_maskrcnn_r_50_fpn    infra_error
doctr_det_predictor infra_error
doctr_reco_predictor    infra_error
moco    infra_error
nvidia_deeprecommender  infra_error
pytorch_CycleGAN_and_pix2pix    infra_error
sam infra_error
tacotron2   infra_error
torch_multimodal_clip   infra_error
yolov3  infra_error
=======================Failed models in amp_bf16=============================
inference:
name    accuracy
DALLE2_pytorch  fail_to_run
cm3leon_generate    fail_to_run
demucs  fail_accuracy
drq fail_accuracy
functorch_maml_omniglot fail_accuracy
hf_BigBird  eager_1st_run_fail
hf_Longformer   eager_1st_run_fail
hf_distil_whisper   infra_error
maml_omniglot   fail_accuracy
nanogpt fail_accuracy
timm_vision_transformer fail_accuracy
vision_maskrcnn fail_to_run

training:
name    accuracy
dlrm    fail_accuracy
drq fail_accuracy
fastNLP_Bert    fail_accuracy
hf_Longformer   fail_to_run
llama   fail_accuracy
phlippe_resnet  fail_accuracy
squeezenet1_1   fail_accuracy
tts_angular fail_accuracy
vision_maskrcnn fail_to_run
=======================Failed models in amp_fp16=============================
inference:
name    accuracy
DALLE2_pytorch  fail_to_run
cm3leon_generate    fail_to_run
demucs  fail_accuracy
functorch_maml_omniglot fail_accuracy
hf_BigBird  eager_1st_run_fail
hf_Longformer   eager_1st_run_fail
hf_distil_whisper   infra_error
maml_omniglot   fail_accuracy
nanogpt fail_accuracy
timm_vision_transformer fail_accuracy
vision_maskrcnn fail_to_run

training:
name    accuracy
Super_SloMo fail_accuracy
dlrm    fail_accuracy
hf_Longformer   fail_to_run
llama   fail_accuracy
phlippe_densenet    fail_accuracy
timm_nfnet  fail_accuracy
vision_maskrcnn fail_to_run
=======================Failed models in float32=============================
inference:
name    accuracy
DALLE2_pytorch  fail_to_run
cm3leon_generate    fail_to_run
functorch_maml_omniglot fail_accuracy
hf_BigBird  eager_1st_run_fail
hf_Longformer   eager_1st_run_fail
hf_T5_base  fail_to_run
hf_distil_whisper   infra_error
maml_omniglot   fail_accuracy
nanogpt fail_accuracy
timm_vision_transformer fail_accuracy

training:
name    accuracy
demucs  fail_accuracy
dlrm    fail_accuracy
functorch_dp_cifar10    fail_accuracy
hf_Longformer   fail_to_run
mobilenet_v2_quantized_qat  fail_to_run
resnet50_quantized_qat  fail_to_run

Reproduce: (replace with real dtype and model)

cd /path/to/pytorch
wget -O inductor_xpu_test.sh https://raw.githubusercontent.com/intel/intel-xpu-backend-for-triton/main/.github/scripts/inductor_xpu_test.sh
pip install pandas
bash inductor_xpu_test.sh torchbench $dtype $mode accuracy xpu 0 static 1 0 $model

Version:

root@a4bf01946f13:/home# python collect_env.py 
Collecting environment information...
PyTorch version: 2.1.0a0+git8a1575b
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             224
On-line CPU(s) list:                0-223
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8480+
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 56
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3800.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          5.3 MiB (112 instances)
L1i cache:                          3.5 MiB (112 instances)
L2 cache:                           224 MiB (112 instances)
L3 cache:                           210 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-55,112-167
NUMA node1 CPU(s):                  56-111,168-223
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] bert-pytorch==0.0.1a4
[pip3] clip-anytorch==2.6.0
[pip3] CoCa-pytorch==0.1.0
[pip3] dalle2-pytorch==1.14.2
[pip3] ema-pytorch==0.3.3
[pip3] flake8==7.0.0
[pip3] functorch==1.14.0a0+b71aa0b
[pip3] intel-extension-for-pytorch==2.1.10+git99b4297
[pip3] mypy==1.8.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] onnx==1.15.0
[pip3] open-clip-torch==2.24.0
[pip3] pytorch-warmup==0.1.1
[pip3] rotary-embedding-torch==0.3.3
[pip3] torch==2.1.0a0+git59f7c41
[pip3] torch-fidelity==0.3.0
[pip3] torch_geometric==2.4.0
[pip3] torchaudio==2.2.0a0+02586da
[pip3] torchbench==0.1
[pip3] torchdata==0.7.1
[pip3] torchmetrics==1.0.3
[pip3] torchmultimodal==0.1.0b0
[pip3] torchrec==0.6.0
[pip3] torchtext==0.17.0a0+2c5e344
[pip3] torchvision==0.18.0a0+806dba6
[pip3] triton==3.0.0
[pip3] vector_quantize_pytorch==1.12.17
[conda] bert-pytorch              0.0.1a4                   dev_0    <develop>
[conda] blas                      1.0                         mkl  
[conda] clip-anytorch             2.6.0                    pypi_0    pypi
[conda] coca-pytorch              0.1.0                    pypi_0    pypi
[conda] dalle2-pytorch            1.14.2                   pypi_0    pypi
[conda] ema-pytorch               0.3.3                    pypi_0    pypi
[conda] functorch                 1.14.0a0+b71aa0b          pypi_0    pypi
[conda] intel-extension-for-pytorch 2.1.10+git99b4297          pypi_0    pypi
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0           py310h5eee18b_1  
[conda] mkl_fft                   1.3.8           py310h5eee18b_0  
[conda] mkl_random                1.2.4           py310hdb19cb5_0  
[conda] numpy                     1.23.5                   pypi_0    pypi
[conda] open-clip-torch           2.24.0                   pypi_0    pypi
[conda] pytorch-warmup            0.1.1                    pypi_0    pypi
[conda] rotary-embedding-torch    0.3.3                    pypi_0    pypi
[conda] torch                     2.1.0a0+git59f7c41          pypi_0    pypi
[conda] torch-fidelity            0.3.0                    pypi_0    pypi
[conda] torch-geometric           2.4.0                    pypi_0    pypi
[conda] torchaudio                2.2.0a0+02586da          pypi_0    pypi
[conda] torchbench                0.1                       dev_0    <develop>
[conda] torchdata                 0.7.1                    pypi_0    pypi
[conda] torchmetrics              1.0.3                    pypi_0    pypi
[conda] torchmultimodal           0.1.0b0                  pypi_0    pypi
[conda] torchrec                  0.6.0                    pypi_0    pypi
[conda] torchtext                 0.17.0a0+2c5e344          pypi_0    pypi
[conda] torchvision               0.18.0a0+806dba6          pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   1.12.17                  pypi_0    pypi

triton: https://github.com/intel/intel-xpu-backend-for-triton/commit/97ac4f91d149a3392d6e14f5d39aa4953fb6c56e

alexbaden commented 7 months ago

See child issues

whitneywhtsang commented 7 months ago

The pinned commit of torch vision (47cd5ea8e21d7596a24907710411d6b4a43f628d https://github.com/Stonepia/pytorch/blob/dev/triton-test-3.0/.github/ci_commit_pins/vision.txt) cannot be build successfully with the latest ffmpeg, due to removal of several deprecated features, including the flags AV_CODEC_CAP_TRUNCATED, AV_CODEC_CAP_AUTO_THREADS, AV_CODEC_CAP_INTRA_ONLY, AV_CODEC_CAP_LOSSLESS, and AVFMT_FLAG_PRIV_OPT.

/home/jovyan/vision/torchvision/csrc/io/decoder/stream.cpp: In member function ‘int ffmpeg::Stream::openCodec(std::vector<ffmpeg::DecoderMetadata>*, int)’:
/home/jovyan/vision/torchvision/csrc/io/decoder/stream.cpp:68:42: error: ‘AV_CODEC_CAP_INTRA_ONLY’ was not declared in this scope; did you mean ‘AV_CODEC_PROP_INTRA_ONLY’?
   68 |     if (codecCtx_->codec->capabilities & AV_CODEC_CAP_INTRA_ONLY) {
      |                                          ^~~~~~~~~~~~~~~~~~~~~~~
      |                                          AV_CODEC_PROP_INTRA_ONLY

conda install -c conda-forge 'ffmpeg<4.4' can be used to downgrade ffmpeg.

whitneywhtsang commented 7 months ago

dlrm passes with the setup below:

Stonepia/pytorch    dev/triton-test-3.0 0f6d72ce16bd4b30402dcad97144d17cd7bc53ed
weishi-deng/benchmark   9371b9e13c826f3930e54346b4d619cb59182f68    
intel/intel-xpu-backend-for-triton  b6d3678483dbffa58f0470a46c0b512f223aabda    
intel-extension-for-pytorch 2.1.10+git99b4297       
torch                       2.1.0a0+git0f6d72c      
torchaudio                  2.0.0a0+a8f4e97     
torchtext                   0.16.0a0+b0ebddc        
torchvision                 0.18.0a0+a52607e        
whitneywhtsang commented 7 months ago

To resolve infra_error: ImportError: libGL.so.1: cannot open shared object file: No such file or directory,

sudo apt install libgl1-mesa-glx

To resolve the error: TypeError: can't convert xpu:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first., can modify torchbenchmark/models/LearningToPaint/baseline/utils/util.py like:

-USE_CUDA = torch.cuda.is_available()
+USE_CUDA = torch.cuda.is_available() or torch.xpu.is_available()

With the above changes and setup described in https://github.com/intel/intel-xpu-backend-for-triton/issues/438#issuecomment-1937102129, Background_Matting and LearningToPaint can both pass.

whitneywhtsang commented 7 months ago

nvidia_deeprecommender, pytorch_CycleGAN_and_pix2pix, torch_multimodal_clip and yolov3 pass with https://github.com/weishi-deng/benchmark/commit/02e383463fa954c49db2e8983e2c6441afc2ca5a.

etiotto commented 7 months ago

@whitneywhtsang so far looks like you found benchmarks or environment problems only (for this benchmark). Correct ?

whitneywhtsang commented 7 months ago

@whitneywhtsang so far looks like you found benchmarks or environment problems only (for this benchmark). Correct ?

Correct, and there are no regressions found compare to my v2.1 run.

whitneywhtsang commented 6 months ago

@vlad-penkin There are no regressions, can we close this issue?