Test: already tested on android(pixel 4) and cuda(setting tensor_parallel_shrads=2)
Test model: amd/AMD-OLMo-1B(without clip_qkv) and allenai/OLMo-1B-0724-hf(with clip_qkv).
However, generation quality of the latter one is not so good as expected even though I've tried different implementation of the clip_qkv mechanism, e.g. te.compute and nn.maximum/minimum.
And finally, I checked the doc and following one is the most simplified:
if self.clip_qkv is not None:
qkv = qkv.maximum(-self.clip_qkv).minimum(self.clip_qkv)
But still the result isn't good enough.
This is output from CLI:
/AMD-OLMo-1B-SFT-q4f16_1-cuda.so --device cuda --overrides "tensor_parallel_shards=2"
[2024-11-24 09:24:04] INFO auto_device.py:79: Found device: cuda:0
[2024-11-24 09:24:04] INFO engine_base.py:143: Using library model: ./dist/libs/AMD-OLMo-1B-SFT-q4f16_1-cuda.so
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048.
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 2048, prefill chunk size will be set to 2048.
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 11593, prefill chunk size will be set to 2048.
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 2048, prefill chunk size is 2048.
[09:24:04] /workspace/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 2026.319 MB (Parameters: 631.266 MB. KVCache: 336.268 MB. Temporary buffer: 1058.785 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
>>> what is the result of 1 + 1?
The result is 2.
>>>
And this is the output from Andorid(pixel 4):
Please note that this is my first PR. If I got something missed, please point it out. Thanks!
This PR add support for OLMo architecture.
Additional support: add support for clip-qkv.
Test: already tested on android(pixel 4) and cuda(setting tensor_parallel_shrads=2)
Test model: amd/AMD-OLMo-1B(without clip_qkv) and allenai/OLMo-1B-0724-hf(with clip_qkv). However, generation quality of the latter one is not so good as expected even though I've tried different implementation of the clip_qkv mechanism, e.g. te.compute and nn.maximum/minimum. And finally, I checked the doc and following one is the most simplified:
But still the result isn't good enough.
This is output from CLI:
And this is the output from Andorid(pixel 4):
Please note that this is my first PR. If I got something missed, please point it out. Thanks!