Closed elikoga closed 1 year ago
Hey! Thank you for sharing your numbers!
Quick sanity checks for running on GPUs after looking at the script you shared here:
model.to(device='cuda:0', dtype=torch.bfloat16)
. attn_impl='triton'
kwarg in AutoModelForCausalLM.from_pretrained
to use Triton Flash Attention implementation which is faster. Note that this for you would need to install the following dependancies: flash-attn==0.2.8
, triton==2.0.0.dev20221202
. Note that installing flash-attn
sometimes is hairy, so install torch
first and install flash-attn
after. Our README in the Hub also has the exact snippets to use.
Keep me posted if this doesn't work. Happy to help you out!
I'm on Windows so no Triton for me :(
Don't know how bfloat16 might affect performance but I can try later
That should help with inference.
Closing this issue. Let me know if you run into blockers or if I can help with anything else!
Got it running on my laptop with a i5-1135G7, 16GB RAM and a RTX 3060 connected with Thunderbolt in an eGPU enclosure, running Windows 10.
It's loading the model after about 45s on my ssd disk.
When generating 100 tokens from
class AVeryLongClass:
(which is 8 tokens long), I'm getting numbers around ~8.47s for those tokens so about 11.8 tok/s. It gets slower with a bigger context of course. You can see my script here https://gist.github.com/elikoga/c300b9bf6b090fda9187644766347348Just wanted to share some numbers and where I got it running :D I like the generation results I'm seeing so far
Maybe you can share some of your numbers too