Closed kimborgen closed 1 year ago
Ahhhhhh, hf/transformers have ported the model allready: https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py
This is not reflected in the falcon-7b/40b hf repos.
A quick compare gives the following results:
with model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
nvidia-smi for hf model
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:3E:00.0 Off | Off |
| 75% 89C P2 298W / 300W | 28846MiB / 49140MiB | 92% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Timing: average time over 10 runs was 27.01956663131714, it produced 1930 tokens, which is 7.1429717076328885 tokens/s
With
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:3E:00.0 Off | Off |
| 52% 78C P2 223W / 300W | 15046MiB / 49140MiB | 56% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
average time over 10 runs was 9.714283990859986, it produced 1930 tokens, which is 19.867650583572665 tokens/s
Using the local model in this repo produces:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:3E:00.0 Off | Off |
| 46% 76C P2 230W / 300W | 15044MiB / 49140MiB | 59% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
and
average time over 10 runs was 9.576357984542847, it produced 1930 tokens, which is 20.15379962941239 tokens/s
About the same, could be random, but could also be due to the model_rotary extraction in #11
But does this mean that the HF repo did not fix past_value_keys?
To recap:
Ah, it seems the hf port has not made it into transformers=4.31.0, forking the code from transformers main gives much better results HF current: [wops I cleared the console, but it was like 600 tokens with 7-8 tokens/s ish] HF port directly: generated 2034 tokens in 92.5561888217926 seconds giving 21.975840037193592 tokens/s
fixed in pr #16
Refs: Discussions:
PR:
Issue: Model does not take utilize past_key_values