where can it run? hardware specs, performance data

elikoga commented 1 year ago

Got it running on my laptop with a i5-1135G7, 16GB RAM and a RTX 3060 connected with Thunderbolt in an eGPU enclosure, running Windows 10.

It's loading the model after about 45s on my ssd disk.

When generating 100 tokens from class AVeryLongClass: (which is 8 tokens long), I'm getting numbers around ~8.47s for those tokens so about 11.8 tok/s. It gets slower with a bigger context of course. You can see my script here https://gist.github.com/elikoga/c300b9bf6b090fda9187644766347348

Just wanted to share some numbers and where I got it running :D I like the generation results I'm seeing so far

Maybe you can share some of your numbers too

madhavatreplit commented 1 year ago

Hey! Thank you for sharing your numbers!

Quick sanity checks for running on GPUs after looking at the script you shared here:

seems like we are not using the weights in bfloat16 there? You can support that with model.to(device='cuda:0', dtype=torch.bfloat16).
we can pass in attn_impl='triton' kwarg in AutoModelForCausalLM.from_pretrained to use Triton Flash Attention implementation which is faster. Note that this for you would need to install the following dependancies: flash-attn==0.2.8, triton==2.0.0.dev20221202. Note that installing flash-attn sometimes is hairy, so install torch first and install flash-attn after.

Our README in the Hub also has the exact snippets to use.

Keep me posted if this doesn't work. Happy to help you out!

elikoga commented 1 year ago

I'm on Windows so no Triton for me :(

Don't know how bfloat16 might affect performance but I can try later

madhavatreplit commented 1 year ago

That should help with inference.

Closing this issue. Let me know if you run into blockers or if I can help with anything else!

replit / ReplitLM

where can it run? hardware specs, performance data #7