huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools

https://huggingface.co/docs/optimum/main/en/intel/index

Apache License 2.0

355 stars 99 forks source link

Convert i64 input tensors to i32 for GPU plugin #781

Open yeonbok opened 1 week ago

yeonbok commented 1 week ago

What does this PR do?

OpenVINO GPU plugin does not support int64 natively so i64 inputs are always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

eaidova commented 1 week ago

@yeonbok thank you for your PR, but I'm not sure that it is right place to do that:

device can be changed in runtime with model.to() method, while you apply these changes only on initialization stange. So precision will not be updated if device will be changed in future and opposite, if it is applicable only for GPU, it will be used for CPU if model was initialized for GPU before. It also may produce some inconsistency between fresh converted model and reloading model dumped from model class using save_pretrained after applying preprocessing step.
additional questions about can this feature be useful for CPU as well? @dmitry-gorokhov @maxnick @usstq maybe you can comment here?
Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.

dmitry-gorokhov commented 1 week ago

@eaidova Yes, it might be benefitial for CPU as well (since internally we anyway convert i64 to i32). I would say it sounds very logical to have such logic aligned for all devices.

yeonbok commented 1 week ago

@eaidova Thanks for the review and @dmitry-gorokhov thanks for the inputs for the CPU plugin. Sounds that it can be applied generally.. I will remove the GPU condition.

Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.

Actually current GPU (and CPU) plugin is always converting to i32, even though the model requires i64 we dont support. I will add a comment regarding this.

AlexKoff88 commented 1 week ago

there was a discussion about this some time ago. Maybe @slyalin remembers the context.

HuggingFaceDocBuilderDev commented 1 week ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

slyalin commented 1 week ago

What's the expected gain from this change? Do you have numbers?

By doing this and having the desired effect of "avoiding runtime conversion" you need to align other parts of the pipeline that supply these data: here in optimum-intel and in the deployment scenarios in C++ code. I don't think that in optimum you can avoid this runtime conversion, it will just happen in another place where int64 torch tensor will be converted to our updated i32 input tensor. Right?

As for C++ code, for example, code here https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/greedy_decoding.cpp directly relies on i64. You need to propagate this change to other parts to understand the impact (and value). Including tokenizers and detokenizers. Just to "avoid runtime conversion".

Sure, in practice we won't see real sizes above 2G I believe, so the numbers could be represented as i32. But I don't see how this PR is enough to leverage this possibility in the consistent manner.

apaniukov commented 1 week ago

OpenVINO Tokenizer uses i32 internally too but sets output/input types to i64 for compatibility by default. One can set theType.i32 in CLI and convert_tokenizer function and we can easily switch the default value to i32 if needed.

I didn't remember any model with more than 1_000_00 vocab size, used by the multilingual encoder models. Gemma2 model has 256k vocab, GPT4 - 200k vocab, other LLMs have smaller vocabs. Extrapolating TinyLlama vocab to 1m tokens with the hidden state size of 2048, results in a 1 000 000 * 2048 * (2 bytes) = 4.096 Gb embedding matrix (using fp16/bf16), which is more than 2.2 Gb for the full model (32k vocab). I think it is safe to use i32 for token indices.