huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.73k stars 26.22k forks source link

Codellama-7b is 3x slower in 4.38.2 compared to 4.34.0 with CPU backend #29412

Closed amdrajeevp closed 5 months ago

amdrajeevp commented 6 months ago

System Info

transformers 4.34.0 is ~370ms/token while 4.38.2 it is ~990 ms/token. Model size is also slightly larger in 4.38.2.

Who can help?

No response

Information

Tasks

Reproduction

inputs = tokenizer(" def fibonacci_recursive(n):", return_tensors="pt")
attention_mask = torch.ones(inputs.input_ids.shape)
start = time.perf_counter()
generate_ids = model.generate(input_ids=inputs.input_ids, attention_mask=attention_mask, max_new_tokens=100)
end = time.perf_counter()
print(f"{end-start")

Expected behavior

I expect the model/weights to be the same and produce same latency. This discrepancy is not observed in Llama-2-7B though.

cajukev commented 6 months ago

I've seen a similar slowdown using NousResearch/Nous-Hermes-2-SOLAR-10.7B when upgrading from 4.37.2 to 4.38.2 . Would love to know what's happening here.

ArthurZucker commented 6 months ago

A few things could be at play, we added support for static compilation, cached the causal_mask (should be faster) but forced rope to be in float32. cc @gante

gante commented 5 months ago

@amdrajeevp @cajukev @ArthurZucker

Codellama is an interesting case, as it has a quite large base maximum sequence length (16k tokens). @amdrajeevp the larger model size is due to the updated cached causal mask, although we are reconsidering how to best do this without impacting the memory footprint (see https://github.com/huggingface/transformers/issues/29484)

If we measure the time separately for generate and forward, we can see that the additional time in generate is explained by the additional time in forward. If we go down a level and profile forward (v4.37.2 vs main), we can conclude the following, if we run on a GPU:

  1. The new causal_mask update at the start of forward does take a significant GPU time, but it does not slow down inference. There are many gaps between subsequent kernel launches, which is the bottleneck;
  2. The changes in LlamaRotaryEmbedding.forward do negatively impact the eager forward pass [v4.37.2: use cached tensors; main: compute the tensors from scratch, as it is more efficient in the compiled code path]. On my measurements, this accounts for ~80% of the slowdown (0.02 ms/layer -> 0.2ms/layer)
  3. For some reason, the SDPA computation is also slower. I am not sure why 🤔 This accounts for ~20% of the slowdown on my measurements (0.05ms/layer -> 0.09ms/layer)
Script to profile `forward` ```py # Run this. Then, on your terminal, run `tensorboard --logdir ./tb_logs/`. from transformers import AutoModelForCausalLM, AutoTokenizer import torch import time from torch.profiler import ProfilerActivity, profile, tensorboard_trace_handler max_new_tokens = 100 model_name = "codellama/CodeLlama-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16) inputs = tokenizer(" def fibonacci_recursive(n):", return_tensors="pt").to(model.device) model_out = model(**inputs) past_kv = model_out.past_key_values new_inputs = torch.argmax(model_out.logits[:, -1, :], dim=-1).unsqueeze(0) fwd_times = [] profile_dir = "./tb_logs" with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True, on_trace_ready=tensorboard_trace_handler(dir_name=profile_dir), ): for i in range(3): start = time.perf_counter() model_out = model(new_inputs, past_key_values=past_kv) end = time.perf_counter() fwd_times.append((end-start)*1000) past_kv = model_out.past_key_values new_inputs = torch.argmax(model_out.logits[:, -1, :], dim=-1).unsqueeze(0) ```

@ArthurZucker

  1. While you were measuring latency with torch.compile, using a cached sin/cos was worse, correct? If so, we should bite the bullet here, and accept slowdowns in some cases while in eager mode;
  2. Any idea why SDPA became slower after v4.37.2? The only difference I see is the attention mask argument. If you have no clue, I can dive deeper to find the cause :)
ArthurZucker commented 5 months ago

For some reason, the SDPA computation is also slower. I am not sure why 🤔 This accounts for ~20% of the slowdown on my measurements (0.05ms/layer -> 0.09ms/layer)

this is due to the is_causal flag being used properly or not I believe.

While you were measuring latency with torch.compile, using a cached sin/cos was worse, correct?

Not just worse, impossible to launch cudagraphs because you access cached values, and as we later discovered you also have precision issues. Now computing in float32 is also slower, so that is something to take into account.

If the 3x slowdown only happens on CPU, IMO that is something we can try to combat, and we should dive a bit, but CPU slowdowns can come from the float32 rope. We cal also look into lru caching for ROPE. Note that we de-activated gradients so should be enven faster. All of this should be taken into account when checking performances.

SDPA should not be hard to tackle, if you take a memory efficient path, you are slower but more memory efficient, If you take a fast path you use more memory.

gante commented 5 months ago

@amdrajeevp @cajukev I'm assuming your inputs are much shorter than the model's max_position_embeddings. In that case, could try initializing the model with a smaller max_position_embeddings? I suspect it may help in your case

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
print(model.model.causal_mask.shape) # torch.Size([4096, 4096])

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", max_position_embeddings=1024)
print(model.model.causal_mask.shape) # torch.Size([1024, 1024])
ArthurZucker commented 5 months ago

Actually, it seems to be an accelerate issue for me. If I don't use device_map="auto" on main I have faster forward

gante commented 5 months ago

@ArthurZucker I have the same speed/memory consumption with and without accelerate on a RTX3090, i.e.

model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto", torch_dtype=torch.float16)

vs

model = AutoModelForCausalLM.from_pretrained(repo_id).to(device="cuda", dtype=torch.float16)
ArthurZucker commented 5 months ago

Linked PR will adresse this