Past key values - Githubissues

kimborgen commented 1 year ago

Refs: Discussions:

PR:

https://huggingface.co/tiiuae/falcon-40b/discussions/85

Issue: Model does not take utilize past_key_values

kimborgen commented 1 year ago

Ahhhhhh, hf/transformers have ported the model allready: https://github.com/huggingface/transformers/blob/main/src/transformers/models/falcon/modeling_falcon.py

This is not reflected in the falcon-7b/40b hf repos.

kimborgen commented 1 year ago

A quick compare gives the following results:

with model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto") nvidia-smi for hf model

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:3E:00.0 Off |                  Off |
| 75%   89C    P2   298W / 300W |  28846MiB / 49140MiB |     92%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Timing: average time over 10 runs was 27.01956663131714, it produced 1930 tokens, which is 7.1429717076328885 tokens/s

Does it use bfloat32?

With model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:3E:00.0 Off |                  Off |
| 52%   78C    P2   223W / 300W |  15046MiB / 49140MiB |     56%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

average time over 10 runs was 9.714283990859986, it produced 1930 tokens, which is 19.867650583572665 tokens/s

Using the local model in this repo produces:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:3E:00.0 Off |                  Off |
| 46%   76C    P2   230W / 300W |  15044MiB / 49140MiB |     59%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

and

average time over 10 runs was 9.576357984542847, it produced 1930 tokens, which is 20.15379962941239 tokens/s

About the same, could be random, but could also be due to the model_rotary extraction in #11

But does this mean that the HF repo did not fix past_value_keys?

kimborgen commented 1 year ago

To recap:

kimborgen commented 1 year ago

Ah, it seems the hf port has not made it into transformers=4.31.0, forking the code from transformers main gives much better results HF current: [wops I cleared the console, but it was like 600 tokens with 7-8 tokens/s ish] HF port directly: generated 2034 tokens in 92.5561888217926 seconds giving 21.975840037193592 tokens/s

kimborgen commented 1 year ago

fixed in pr #16

kimborgen / falcon-llm

Past key values #1