[Model] Add support for OLMo architecture

Lanssi commented 2 days ago

This PR add support for OLMo architecture.

Additional support: add support for clip-qkv.

Test: already tested on android(pixel 4) and cuda(setting tensor_parallel_shrads=2)

Test model: amd/AMD-OLMo-1B(without clip_qkv) and allenai/OLMo-1B-0724-hf(with clip_qkv). However, generation quality of the latter one is not so good as expected even though I've tried different implementation of the clip_qkv mechanism, e.g. te.compute and nn.maximum/minimum. And finally, I checked the doc and following one is the most simplified:

if self.clip_qkv is not None:
        qkv = qkv.maximum(-self.clip_qkv).minimum(self.clip_qkv)

But still the result isn't good enough.

This is output from CLI:

/AMD-OLMo-1B-SFT-q4f16_1-cuda.so --device cuda --overrides "tensor_parallel_shards=2"
[2024-11-24 09:24:04] INFO auto_device.py:79: Found device: cuda:0
[2024-11-24 09:24:04] INFO engine_base.py:143: Using library model: ./dist/libs/AMD-OLMo-1B-SFT-q4f16_1-cuda.so
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048. 
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048. 
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 11593, prefill chunk size will be set to 2048. 
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 2048, prefill chunk size is 2048.
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 2026.319 MB (Parameters: 631.266 MB. KVCache: 336.268 MB. Temporary buffer: 1058.785 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

>>> what is the result of 1 + 1?
The result is 2.
>>>

And this is the output from Andorid(pixel 4): Screenshot_20241016-021653

Please note that this is my first PR. If I got something missed, please point it out. Thanks!

tlopex commented 1 day ago

@Lanssi Thanks for your contribution! I’ll take a look at your code once it passes CI.

Lanssi commented 1 day ago

@Lanssi Thanks for your contribution! I’ll take a look at your code once it passes CI.

Yes! Thanks!

mlc-ai / mlc-llm

[Model] Add support for OLMo architecture #3046