Open profitgrowinginnovator opened 4 days ago
Moved this to the models repo since this is an implementation issue.
The current implementation is not fully optimized 😅 the cache, sampling, flash attention come to mind.
We should be working on that pretty soon 👀
That is good to know because I like very much how Burn works in general. Please let me know when the performance is optimised so I update my blog post: https://mectors.medium.com/ai-explained-llm-performance-slow-python-transformers-fast-golang-rust-but-not-always-e3895f03c760
Describe the bug Running the same Tiny Llava 3.1 model takes 4.59 seconds to load, 2.46 seconds to generate and generates 50.47 tokens per second with Python Transformers on CUDA Tesla P100. However with Burn it takes 9 seconds to load the model, 10 seconds to generate and only around 6.4 tokens are generated per second. What is the point of "Fast Rust" if it is almost 8 times slower or is there a mistake?
To Reproduce Here is the python code: import time import torch from transformers import AutoTokenizer, AutoModelForCausalLM MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" user_prompt = "How many helicopters can a human eat in one sitting?" system_prompt = ( "<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate\n" "<|user|>\n{prompt}\n<|assistant|>\n" ).format(prompt=user_prompt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}")
start_time = time.time() print("Loading tokenizer and model...") tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)
if tokenizer.pad_token is None: tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token}) model.resize_token_embeddings(len(tokenizer))
load_time = time.time() - start_time print(f"Model loaded in {load_time:.2f} seconds.")
print("Tokenizing input...") inputs = tokenizer( system_prompt, return_tensors="pt", padding=True, truncation=True, ).to(device)
print("Generating response...") start_time = time.time()
outputs = model.generate( inputs.input_ids, attention_mask=inputs.attention_mask, max_length=200, # Limit response length temperature=0.7, # Adjusts randomness in output top_p=0.9, # Nucleus sampling do_sample=True, )
generation_time = time.time() - start_time num_tokens = outputs.shape[1] # Number of tokens generated tokens_per_second = num_tokens / generation_time response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Response:\n", response) print(f"Generation completed in {generation_time:.2f} seconds.") print(f"Tokens generated per second: {tokens_per_second:.2f}")
and here is the burn solution: download the llama-burn code from https://github.com/tracel-ai/models.git cargo build --release --features tiny,cuda --example chat ./target/release/examples/chat --top-p 0.9 --temperature=0.7 --max-seq-len=200
Expected behavior Rust should be many times faster, not 8 times slower.
Screenshots
Desktop (please complete the following information):