Open yeonbok opened 1 week ago
@yeonbok thank you for your PR, but I'm not sure that it is right place to do that:
@eaidova Yes, it might be benefitial for CPU as well (since internally we anyway convert i64 to i32). I would say it sounds very logical to have such logic aligned for all devices.
@eaidova Thanks for the review and @dmitry-gorokhov thanks for the inputs for the CPU plugin. Sounds that it can be applied generally.. I will remove the GPU condition.
Do you validated this approach on wide range of models? Potentially, changing precision of input_ids can lead to overflow if model has too large vocabulary. If it is stable and these changes are suitable for CPU, we can consider to change input precision on model conversion step.
Actually current GPU (and CPU) plugin is always converting to i32, even though the model requires i64 we dont support. I will add a comment regarding this.
there was a discussion about this some time ago. Maybe @slyalin remembers the context.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
What's the expected gain from this change? Do you have numbers?
By doing this and having the desired effect of "avoiding runtime conversion" you need to align other parts of the pipeline that supply these data: here in optimum-intel and in the deployment scenarios in C++ code. I don't think that in optimum you can avoid this runtime conversion, it will just happen in another place where int64 torch tensor will be converted to our updated i32 input tensor. Right?
As for C++ code, for example, code here https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/greedy_decoding.cpp directly relies on i64. You need to propagate this change to other parts to understand the impact (and value). Including tokenizers and detokenizers. Just to "avoid runtime conversion".
Sure, in practice we won't see real sizes above 2G I believe, so the numbers could be represented as i32. But I don't see how this PR is enough to leverage this possibility in the consistent manner.
OpenVINO Tokenizer uses i32 internally too but sets output/input types to i64 for compatibility by default. One can set theType.i32
in CLI and convert_tokenizer
function and we can easily switch the default value to i32 if needed.
I didn't remember any model with more than 1_000_00 vocab size, used by the multilingual encoder models. Gemma2 model has 256k vocab, GPT4 - 200k vocab, other LLMs have smaller vocabs. Extrapolating TinyLlama vocab to 1m tokens with the hidden state size of 2048, results in a 1 000 000 * 2048 * (2 bytes) = 4.096 Gb
embedding matrix (using fp16/bf16), which is more than 2.2 Gb for the full model (32k vocab).
I think it is safe to use i32 for token indices.
What does this PR do?
OpenVINO GPU plugin does not support int64 natively so i64 inputs are always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32
Fixes # (issue)
Before submitting