intel / xFasterTransformer

Apache License 2.0
344 stars 60 forks source link

[KVCache] Add inferface and register for kvcache. #336

Closed Duyi-Wang closed 4 months ago

Duyi-Wang commented 4 months ago

KVCache uses float16 by default. Baichuan, Qwen and YaRNLlama previously defaulted to using FP32, but now default to using FP16 as well. Usage: C++ example:

example -t /data/llama-2-7b-hf/tokenizer.model -m /data/llama-2-7b-xft/ -d bf16 --kv_cache_dtype fp16

Python API:

model = xfastertransformer.AutoModel.from_pretrained(
        args.model_path, dtype=args.dtype, kv_cache_dtype=args.kv_cache_dtype
    )

Python demo.py:

python demo.py --kv_cache_dtype {fp32,fp16,int8}