tlopex@tlopex-OMEN-by-HP-Laptop-17-ck1xxx:~/mlc-llm$ mlc_llm chat dist/OLMo-1B-0724-hf-q4f16_1-MLC --device "cuda:0" --overrides context_window_size=2048 --model ./dist/libs/OLMo-1B-0724-hf-q4f16_1-cuda.so
[2024-11-23 23:39:13] INFO auto_device.py:79: Found device: cuda:0
[2024-11-23 23:39:13] INFO engine_base.py:143: Using library model: ./dist/libs/OLMo-1B-0724-hf-q4f16_1-cuda.so
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048.
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048.
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 85475, prefill chunk size will be set to 4096.
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 2048, prefill chunk size is 2048.
[23:39:13] /home/tlopex/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 3171.858 MB (Parameters: 686.531 MB. KVCache: 368.534 MB. Temporary buffer: 2116.793 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
>>> Hello!
Hi
>>> Who are you?
I'm an AI Assistant who may (and most likely will) one day help you find what you're looking for in Birmingham — Technology->People
>>> Can you give me a joke?
I never know when to laugh, do I?
This PR supports Olmo architecture.
The model conversation demonstration is here: