huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools
https://huggingface.co/docs/optimum/main/en/intel/index
Apache License 2.0
388 stars 110 forks source link

Enable IPEXModel on XPU #663

Open jiqing-feng opened 5 months ago

jiqing-feng commented 5 months ago

Hi @echarlaix . I want to enable all model utils in ipex (modeling_utils) on XPU; it may need some changes including another if-branch in forward or 2 forward functions (1 for CPU and 1 for GPU), the k-v cache is also different.

Is there any XPU issue on optimum-intel that may block our work, like the XPU version and CI tests? I also need your integration advice. Thx!

echarlaix commented 5 months ago

Hi @jiqing-feng, would it be a similar integration to what was integrated in ipex-llm ?

jiqing-feng commented 5 months ago

Hi @jiqing-feng, would it be a similar integration to what was integrated in ipex-llm ?

Not exactly the same, we plan to keep only one attn forward but will split into different parts and will let tensor device to chose which op should be used, like

llama_attn_forward:
        key_cache, value_cache = preprocess_for_optimize(hidden_states, past_key_value, kwargs)
        query, key, value = self.qkv_gemm(hidden_states, key_cache, value_cache, kwargs)
        key, value = self.rope(key, value, position_ids, past_key_value, kwargs)
        present = get_present(key, value, past_key_value)
        attn_output, attn_weight, past_key_value = self.sdpa(query, key, value, attention_mask, past_key_value, kwargs)
        attn_output = attn_output.transpose(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
        if not output_attentions:
             attn_weights = None
        return attn_output, attn_weights, past_key_value 

self.sdpa:
        if cpu:   
            sdpa = self.ipex_scale_dot_product
        elif xpu:
           sdpa = self.sdpa_xpu

       (attn_output, attn_weights, past_key_value) =  sdpa (
                query,
                key,
                value,
                math.sqrt(self.head_dim),
                past_key_value,
                None,
                attention_mask,)

        return attn_output, attn_weights, past_key_value       
echarlaix commented 5 months ago

For me it would make sense to keep this integration to ipex-llm and to only enable loading of exported model in optimum-intel (through IPEXModel), what do you think ?

echarlaix commented 4 months ago

Hi @jiqing-feng, I see that different llama modeling (and other additional architectures) were introduced in both ipex and ipex-llm to introduce ipex optimization. I think redefining the modeling of transformers modeling (for different architecture and different optimziation) is not something that we want to introduce in optimum-intel as it will results in significant code additions which will be difficult to maintain in the future and more importantly might cause issues in the future for future transformers release (this happened for example after transformers v4.40.0 release for the openvino export as the model is patched before export https://github.com/huggingface/optimum-intel/pull/682), having such additions could also results in having much constrained transformers or even torch version.

For these reasons I'd be in favor of keeping modeling_utils only for adding changes that are required for the export (like done in https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/utils/modeling_utils.py#L25) and to move the rest to an other repo (itrex or ipex-llm could be good candidates for example) that could be used by optimum-intel by checking for a specific transformers version if compatible in which case we could overwrite the modeling). What do you think ?