huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.92k stars 26.27k forks source link

Make Cache statically configurable at model construction time #32500

Closed guangy10 closed 2 days ago

guangy10 commented 1 month ago

Feature request

Be able to construct and load a model like:

model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    attn_implementation="sdpa",
    generation_config=GenerationConfig(
        use_cache=True,
        cache_implementation=cache_implementation,
        max_length=max_cache_len,
        cache_config={
            "batch_size": batch_size,
            "max_cache_len": max_cache_len,
        },
    ),
)

See additional context in #32253

Motivation

This feature request is to support torch.export(), and ensure the model is exportable in a way that can be further lowered and run in ExecuTorch with performance out-of-the-box.

Your contribution

TBD

guangy10 commented 4 weeks ago

PR is published: https://github.com/huggingface/transformers/pull/32830