mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.26k stars 1.58k forks source link

[Model] Add support for Olmo architecture #3045

Closed tlopex closed 1 day ago

tlopex commented 2 days ago

This PR supports Olmo architecture.

The model conversation demonstration is here:

tlopex@tlopex-OMEN-by-HP-Laptop-17-ck1xxx:~/mlc-llm$ mlc_llm chat dist/OLMo-1B-0724-hf-q4f16_1-MLC              --device "cuda:0" --overrides context_window_size=2048              --model ./dist/libs/OLMo-1B-0724-hf-q4f16_1-cuda.so
[2024-11-23 23:39:13] INFO auto_device.py:79: Found device: cuda:0
[2024-11-23 23:39:13] INFO engine_base.py:143: Using library model: ./dist/libs/OLMo-1B-0724-hf-q4f16_1-cuda.so
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048. 
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048. 
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 85475, prefill chunk size will be set to 4096. 
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 2048, prefill chunk size is 2048.
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 3171.858 MB (Parameters: 686.531 MB. KVCache: 368.534 MB. Temporary buffer: 2116.793 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

>>> Hello!
Hi
>>> Who are you?
I'm an AI Assistant who may (and most likely will) one day help you find what you're looking for in Birmingham — Technology->People
>>> Can you give me a joke?
I never know when to laugh, do I?