Goes nowhere - Githubissues

chrisbward commented 1 year ago

User: tell me about london city

Takes about 8 minutes and the reply is;

flayers: 100%|███████████████████████████| 60/60 [08:36<00:00,  8.62s/it]
------------------------------███████████| 60/60 [08:36<00:00,  4.63s/it]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city

Then immediately kicks off another generation, reply is;

------------------------------
flayers: 100%|███████████████████████████| 60/60 [07:36<00:00,  7.61s/it]
------------------------------███████████| 60/60 [07:36<00:00,  1.62it/s]
A dialog, where User interacts with AI. AI is helpful, kind, obedient, honest, and knows its own limits.
User: Hello, AI.
AI: Hello! How can I assist you today?
User: tell me about london city
AI

And again just repeats

chrisbward commented 1 year ago

Oh I see, I tried the 13B and generations took 3-4 seconds, and then I realised it was adding a token at a time instead of an entire generation. Yikes.

Does 8 minutes per token seem correct? 3090 ti 24GB, 64 GB RAM + 150GB swap NVMe

randaller commented 1 year ago

Does 8 minutes per token seem correct? 3090 ti 24GB, 64 GB RAM + 150GB swap NVMe

@chrisbward too long, may be RAM is busy so much handling something other. The only python uses 70 GB on a 30B model, probably it will go faster for the next tokens, when system will realize to offload less layers to swap.

Typically, if a model fits into RAM completely (128 Gb), 30B model returns me a single token in a 7 seconds. 65B model in about 2 minutes after warm-up with swap, 13B model in a 2-3 seconds.

chrisbward commented 1 year ago

thanks @randaller - I'm seeing 1 second per token for 7B, 3-4 seconds for 13B, and 8 minutes per token for 30B - guess I need more RAM then!

randaller / llama-chat

Goes nowhere #16