turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.23k stars 238 forks source link

InternLM2 Support #283

Closed brucethemoose closed 3 weeks ago

brucethemoose commented 5 months ago

It uses custom modeling/tokenizer code like Yi used to:

https://huggingface.co/internlm/internlm2-chat-20b

It may or may not already work, or work with a simple repacking hack to "llamafy" it? Consider this a WIP tracking issue, I am downloading it to test right now.

intervitens commented 5 months ago

There are llamafied versions, they do seem to work as is with ExllamaV2. https://huggingface.co/chargoddard/internlm2-20b-llama

turboderp commented 5 months ago

Does anyone know what the llamafication entails? It looks like the models just need tensors renamed and the QKV tensor split into separate Q, K and V tensors. But the safetensors files are substantially larger.

intervitens commented 5 months ago

Are you sure you're looking at the right files? I've compared the original and llamafied models, and llamafied safetensors in total are about 60KB smaller than the original bin files.

brucethemoose commented 5 months ago

This has kinda dropped of my radar, but one thing the custom code implements is its own rope scaling.

brucethemoose commented 5 months ago

See: https://huggingface.co/internlm/internlm2-chat-20b/blob/main/modeling_internlm2.py#L169-L214