E2E test XPU out of memory

mengfei25 commented 1 month ago

🐛 Describe the bug

Out of memory in weekly test, https://github.com/intel/torch-xpu-ops/actions/runs/10218591763

Model list:

[ ] GPTJForCausalLM
[ ] GPTJForQuestionAnswering
[ ] hf_distil_whisper
[ ] hf_T5_base
[ ] llava
[ ] stable_diffusion_unet

RuntimeError: XPU out of memory Suite	Dtype	Mode	Scenario	Model
huggingface	amp_bf16	inference	accuracy	GPTJForCausalLM
huggingface	amp_bf16	inference	accuracy	GPTJForQuestionAnswering
huggingface	amp_bf16	inference	performance	GPTJForQuestionAnswering
huggingface	amp_bf16	training	accuracy	GPTJForCausalLM
huggingface	amp_bf16	training	accuracy	GPTJForQuestionAnswering
huggingface	amp_bf16	training	performance	GPTJForQuestionAnswering
huggingface	amp_fp16	inference	accuracy	GPTJForQuestionAnswering
huggingface	amp_fp16	inference	accuracy	GPTJForCausalLM
huggingface	amp_fp16	inference	performance	GPTJForQuestionAnswering
huggingface	amp_fp16	training	accuracy	GPTJForCausalLM
huggingface	amp_fp16	training	accuracy	GPTJForQuestionAnswering
huggingface	amp_fp16	training	performance	GPTJForQuestionAnswering
huggingface	bfloat16	training	accuracy	GPTJForCausalLM
huggingface	bfloat16	training	accuracy	GPTJForQuestionAnswering
huggingface	float16	training	accuracy	GPTJForCausalLM
huggingface	float16	training	accuracy	GPTJForQuestionAnswering
torchbench	amp_bf16	inference	accuracy	hf_T5_base
torchbench	amp_bf16	inference	accuracy	stable_diffusion_unet
torchbench	amp_bf16	inference	accuracy	llava
torchbench	amp_bf16	inference	performance	llava
torchbench	amp_bf16	inference	performance	hf_distil_whisper
torchbench	amp_bf16	inference	performance	stable_diffusion_unet
torchbench	amp_bf16	training	accuracy	stable_diffusion_unet
torchbench	amp_bf16	training	accuracy	llava
torchbench	amp_bf16	training	performance	stable_diffusion_unet
torchbench	amp_fp16	inference	accuracy	stable_diffusion_unet
torchbench	amp_fp16	inference	accuracy	hf_T5_base
torchbench	amp_fp16	inference	accuracy	llava
torchbench	amp_fp16	inference	performance	hf_distil_whisper
torchbench	amp_fp16	inference	performance	stable_diffusion_unet
torchbench	amp_fp16	inference	performance	llava
torchbench	amp_fp16	training	accuracy	stable_diffusion_unet
torchbench	amp_fp16	training	accuracy	llava
torchbench	amp_fp16	training	performance	stable_diffusion_unet
torchbench	bfloat16	inference	accuracy	hf_T5_base
torchbench	bfloat16	inference	accuracy	llava
torchbench	bfloat16	inference	performance	llava
torchbench	bfloat16	training	accuracy	llava
torchbench	bfloat16	training	accuracy	stable_diffusion_unet
torchbench	bfloat16	training	performance	stable_diffusion_unet
torchbench	float16	inference	accuracy	llava
torchbench	float16	inference	accuracy	hf_T5_base
torchbench	float16	inference	performance	llava
torchbench	float16	training	accuracy	stable_diffusion_unet
torchbench	float16	training	accuracy	llava
torchbench	float16	training	performance	stable_diffusion_unet
torchbench	float32	inference	accuracy	stable_diffusion_unet
torchbench	float32	inference	accuracy	llava
torchbench	float32	inference	performance	hf_distil_whisper
torchbench	float32	inference	performance	stable_diffusion_unet
torchbench	float32	inference	performance	llava
torchbench	float32	training	accuracy	stable_diffusion_unet
torchbench	float32	training	accuracy	llava
torchbench	float32	training	performance	stable_diffusion_unet

RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES) Suite	Dtype	Mode	Scenario	Model
huggingface	amp_bf16	inference	performance	GPTJForCausalLM
huggingface	amp_bf16	training	accuracy	BlenderbotForConditionalGeneration
huggingface	amp_bf16	training	performance	GPTJForCausalLM
huggingface	amp_fp16	inference	performance	GPTJForCausalLM
huggingface	amp_fp16	training	accuracy	BlenderbotForConditionalGeneration
huggingface	amp_fp16	training	performance	GPTJForCausalLM
huggingface	float32	training	accuracy	BlenderbotForConditionalGeneration
huggingface	float32	training	accuracy	GPTJForCausalLM
huggingface	float32	training	accuracy	GPTJForQuestionAnswering
huggingface	float32	training	performance	GPTJForCausalLM
huggingface	float32	training	performance	GPTJForQuestionAnswering
torchbench	bfloat16	inference	performance	hf_distil_whisper
torchbench	float16	inference	performance	hf_distil_whisper

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

mengfei25 commented 1 month ago

Looks like hf_distil_whisper is regression ./torchbench/amp_bf16/inductor_torchbench_amp_bf16_inference_xpu_performance_all.log:xpu eval hf_distil_whisper running benchmark: 100%|██████████| 10/10 [00:00<00:00, 10.61it/s]erformance_all.log- ./torchbench/amp_bf16/inductor_torchbench_amp_bf16_inference_xpu_performance_all.log-1.656x pytorch: https://github.com/pytorch/pytorch/commit/dadc0ed torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/45e55a3

chuanqi129 commented 3 weeks ago

Looks like hf_distil_whisper is regression ./torchbench/amp_bf16/inductor_torchbench_amp_bf16_inference_xpu_performance_all.log:xpu eval hf_distil_whisper running benchmark: 100%|██████████| 10/10 [00:00<00:00, 10.61it/s]erformance_all.log- ./torchbench/amp_bf16/inductor_torchbench_amp_bf16_inference_xpu_performance_all.log-1.656x pytorch: pytorch/pytorch@dadc0ed torch-xpu-ops: 45e55a3

Hi @retonym, this is a regression issue, can we double check it?

weishi-deng commented 3 weeks ago

Recollect the model test on pytorch/pytorch@dadc0ed torch-xpu-ops: 45e55a3 on my local pvc 1100, this issue exists. Besides, this model also fails with out-of-memory on the CUDA backend.

intel / torch-xpu-ops

E2E test XPU out of memory #701

🐛 Describe the bug

Versions