Closed dvrogozh closed 3 months ago
As of https://github.com/pytorch/pytorch/commit/21144ce5704f5d95dff8d28e3a389c798b03afe3, https://github.com/huggingface/accelerate/commit/b7fa2fa956f40e0b6f650d5eb1764680bf3fd8f7 and https://github.com/huggingface/transformers/commit/485d913dfbf61a1af4d78aa5038f0007805908cb with applied PRs:
Below are my try out results for Huggingface examples (https://github.com/huggingface/transformers/tree/main/examples/pytorch) running with XPU backend on ATS-M (requires export OverrideDefaultFP64Settings=1 && export IGC_EnableDPEmulation=1
at the moment). I tried all the samples except 2: contrastive-image-text and semantic-segmentation.
Overall, Huggingface examples can run on XPU backend with the low performance at the moment due to range of operations falling back to CPU. Effectively one of the goal was to identify these ops for future prioritization. The only example which failed due to missing of some uAPI is speech-pretraining. See details below.
Some aten operations are not implemented for XPU backend (functional impact)
image-classification
, image-detection
, translation
, token-classification
, text-classification
examplesPYTORCH_DEBUG_XPU_FALLBACK=1
)manual
in the table belowSome aten operations are not implemented for XPU backend (performance impact)
PYTORCH_DEBUG_XPU_FALLBACK=1
)explicit
in the table belowSupport for torch.xpu.<memory>
uAPIs is missing
speech-pretraining
exampleop | image-classification | image-detection | translation | token-classification | text-classification | summarization | instance-segmentation | multiple-choice | question-answering | |
---|---|---|---|---|---|---|---|---|---|---|
aten::_cdist_forward | explicit | DETR | ||||||||
aten::_foreachaddcdiv.ScalarList | manual | ViT | DETR | OPUS_MT | BERT | MRPC | ||||
aten::_foreachaddcmul.Scalar | manual | ViT | DETR | OPUS_MT | BERT | |||||
aten::_foreachdiv.ScalarList | manual | ViT | DETR | OPUS_MT | BERT | MRPC | ||||
aten::_foreachlerp.Scalar | manual | ViT | DETR | OPUS_MT | BERT | |||||
aten::_foreachmul.Scalar | manual | ViT | DETR | OPUS_MT | BERT | MRPC | ||||
aten::_foreachmul.Tensor | manual | ViT | DETR | OPUS_MT | BERT | MRPC | ||||
aten::_foreach_norm.Scalar | manual | ViT | DETR | OPUS_MT | BERT | MRPC | ||||
aten::_foreach_sqrt | manual | ViT | DETR | OPUS_MT | BERT | MRPC | ||||
aten::addcdiv.out | explicit | SWIN | ROBERTA | BERT-2 | ||||||
aten::addcmul.out | explicit | GOOGLE-T5 | SWIN | ROBERTA | BERT-2 | |||||
aten::all.all_out | explicit | DETR | BERT | MRPC | BERT-2 | |||||
aten::floor.out | explicit | SWIN | ||||||||
aten::grid_sampler_2d_backward | explicit | SWIN | ||||||||
aten::lerp.Scalar_out | explicit | GOOGLE-T5 | SWIN | ROBERTA | BERT-2 | |||||
aten::linalg_vector_norm.out | explicit | ViT | OPUS_MT | MRPC | GOOGLE-T5 | SWIN | ROBERTA | BERT-2 | ||
aten::linspace.out | explicit | SWIN | ||||||||
aten::native_batch_norm | explicit | SWIN | ||||||||
aten::native_group_norm_backward | explicit | SWIN | ||||||||
aten::nll_loss2d_backward | manual | DETR | SWIN | |||||||
aten::nll_loss2d_forward | manual | DETR | SWIN | |||||||
aten::max_pool2d_with_indices.out | explicit | DETR | ||||||||
aten::prod.int_out | explicit | SWIN | ||||||||
aten::roll | explicit | SWIN | ||||||||
aten::sgn.out | explicit | DETR | ||||||||
aten::sigmoid.out | explicit | DETR | OPUS_MT | SWIN | ||||||
aten::sigmoid_backward.grad_input | explicit | DETR | SWIN | |||||||
aten::silu.out | explicit | OPUS_MT | ||||||||
aten::topk.values | explicit | SWIN | ||||||||
aten::upsample_bilinear2d.out | explicit | SWIN | ||||||||
aten::upsample_bilinear2d_backward.grad_input | explicit | SWIN | ||||||||
aten::upsample_nearest2d.out | explicit | DETR |
@dvrogozh Thank you for such an extensive write up, diving into how it affects the library functionality and opening up draft PRs for enabling this ❤️
It's OK if there isn't full coverage of operations - we support the mps backend despite there not being full coverage yet. It's great that you've investigated and we have an idea how much the fallback can slow things down.
Overall, I don't see any reason why this shouldn't be something we enable. Similar to mps, it's not something we'll probably test on our side though at the moment
cc @ydshieh @muellerzr
Yep agreed :) We are working towards getting this in accelerate first, then the Trainer in terms of which PRs to merge when
I filed one more issue affecting some (not all) examples and tests - cuda path is wrongly hit sometimes on loss.backward()
:
XPU backend is a new backend in PyTorch which targets to enabled hardware acceleration on Intel GPUs via sycl. It's being actively worked on at the moment with first set of patches landed in PyTorch upstream and support disclosed in documentation [1]. Initial version should be available starting from PyTorch 2.4, with 2.5 release as a target point of maturity. Current focus of the effort is on functional aspect to identify and close API gaps, if any, and populate set of offloadable aten operations. Some models and scenarios can already be tried out with the caveat of the low performance due to CPU fallbacks on some operations. Overall, [2] outlines upsrreaming process for XPU backend. Note also some relevant XPU related issues opened on PyTorch side [3].
Previously Intel GPU support in PyTorch was only available via Intel Extension for PyTorch (IPEX). Effectively this support is what is getting now upstreamed to the stock PyTorch.
Here I would like to request Huggingface to enable stock Pytorch XPU backend. Considering that IPEX is actually already enabled in Huggingface repos, this should be fairly trivial to extend it to cover XPU backend since the latter reuses XPU device and operations naming from IPEX era.
I did prototype XPU backend support in Huggingface. Please, check these PRs:
[1] https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support [2] https://github.com/pytorch/pytorch/issues/114842 [3] https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+xpu+in%3Atitle
CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 @sywangyi @kding1