intel / torch-xpu-ops

Apache License 2.0
30 stars 21 forks source link

[Traker][Windows] OneDNN upgrade introduces new failures when testing UT and E2E (huggingface models) #1080

Closed libohao1201 closed 2 days ago

libohao1201 commented 1 week ago

🐛 Describe the bug

@Stonepia Details of new failures are as follows:

E2E Cases Error Type oneDNN (Failed model) Baseline (Failed model)
hf_n10_inference_eager_fp32 Native API returns: -999 (Unknown PI error) BlenderbotForCausalLM DebertaV2ForMaskedLM DebertaV2ForQuestionAnswering MBartForConditionalGeneration BlenderbotForCausalLM DebertaV2ForMaskedLM DebertaV2ForQuestionAnswering
eager_two_runs_differ PegasusForConditionalGeneration RobertaForQuestionAnswering
Quit without pass MBartForConditionalGeneration
hf_n10_inference_eager_fp16 eager_two_runs_differ CamemBert PegasusForConditionalGeneration
hf_n10_inference_eager_bf16 eager_two_runs_differ ConditionalGeneration M2M100ForConditionalGeneration MBartForConditionalGeneration XGLMForCausalLM
hf_n10_train_eager_fp32 Native API returns: -999 (Unknown PI error) BartForConditionalGeneration BlenderbotForCausalLM DebertaV2ForMaskedLM DebertaV2ForQuestionAnswering MBartForConditionalGeneration OPTForCausalLM XGLMForCausalLM BartForConditionalGeneration BlenderbotForCausalLM DebertaV2ForMaskedLM MBartForConditionalGeneration OPTForCausalLM XGLMForCausalLM
Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) DebertaV2ForQuestionAnswering
eager_two_runs_differ BertForMaskedLM BlenderbotSmallForConditionalGeneration
hf_n10_train_eager_fp16 eager_two_runs_differ MegatronBertForCausalLM PLBartForConditionalGeneration
hf_n10_train_eager_bf16 eager_two_runs_differ AllenaiLongformerBase MegatronBertForQuestionAnswering OPTForCausalLM
fail_accuracy M2M100ForConditionalGeneration BartForConditionalGeneration M2M100ForConditionalGeneration
Quit without pass XGLMForCausalLM
hf_n10_train_eager_amp_fp16 eager_two_runs_differ LayoutLMForSequenceClassification MBartForCausalLM MegatronBertForCausalLM PLBartForCausalLM
eager_1st_run_fail AlbertForMaskedLM
eager_2nd_run_fail PegasusForConditionalGeneration XLNetLMHeadModel MegatronBertForCausalLM PegasusForConditionalGeneration
fail_accuracy MT5ForConditionalGeneration MT5ForConditionalGeneration MegatronBertForQuestionAnswering
Quit without pass AlbertForQuestionAnswering AlbertForMaskedLM AlbertForQuestionAnswering XLNetLMHeadModel
hf_n10_train_eager_amp_bf16 eager_two_runs_differ BartForCausalLM BertForQuestionAnswering BlenderbotSmallForCausalLM GoogleFnet MobileBertForMaskedLM
eager_2nd_run_fail PegasusForConditionalGeneration MegatronBertForCausalLM
fail_accuracy MBartForCausalLM MT5ForConditionalGeneration MegatronBertForCausalLM MegatronBertForQuestionAnswering MBartForCausalLM MT5ForConditionalGeneration
Quit without pass AlbertForMaskedLM AlbertForQuestionAnswering M2M100ForConditionalGeneration XLNetLMHeadModel AlbertForMaskedLM AlbertForQuestionAnswering M2M100ForConditionalGeneration PegasusForConditionalGeneration XLNetLMHeadModel
UT OneDNN New Failures
xpu\run_test_with_skip FAILED test_autograd_xpu.py::TestAutograd::test_multi_grad_all_hooks - subpro...
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_addcdiv_fastpath_inplace_xpu_float16
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_addcdiv_fastpath_inplace_xpu_float32
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_addcdiv_slowpath_outplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_addcmul_fastpath_inplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_addcmul_fastpath_outplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_add_fastpath_outplace_xpu_int64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_add_slowpath_outplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_add_slowpath_outplace_xpu_int64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_div_fastpath_outplace_xpu_float32
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_expm1_slowpath_inplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_lerp_fastpath_outplace_xpu_float32
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_lerp_slowpath_inplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_lerp_slowpath_outplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_maximum_slowpath_inplace_xpu_int64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_mul_fastpath_inplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_mul_fastpath_outplace_xpu_int64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_mul_slowpath_outplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_sub_slowpath_inplace_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_xpu_float32
FAILED test_foreach_xpu.py::TestForeachXPU::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_xpu_complex64
FAILED test_foreach_xpu.py::TestForeachXPU::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_xpu_float32

Versions

Config OneDNN acceptance test Baseline
device MTL MTL
torch https://github.com/pytorch/pytorch/tree/gh/yanbing-j/28/orig commit: 4797f800ae337db5a10a0c1986b2e87293362224 https://github.com/pytorch/pytorch commit: 6ad52db8c8d4704a545f9b4b4743f251c0ae2e8c
torch-xpu-ops pinned - bdbda358bc3adc9c4b26aa1ec4003f9eb5d1685e pinned - bdbda358bc3adc9c4b26aa1ec4003f9eb5d1685e
driver 32.0.101.6130 32.0.101.6130
bundle 0.5.3.37 0.5.3.37
os Windows 11 Enterprise Windows 11 Enterprise
Stonepia commented 1 week ago

This issue will offered as a tracker.

@gaopengff and @LuFinch @Stonepia will start from E2E models. @PenghuiCheng will work on UT failures.

Stonepia commented 2 days ago

After triaging, there should not be related to oneDNN upgrade issue. There is no regression. We will track those issues in other thread. Close this issue.