Closed chrisbward closed 1 year ago
Oh I see, I tried the 13B and generations took 3-4 seconds, and then I realised it was adding a token at a time instead of an entire generation. Yikes.
Does 8 minutes per token seem correct? 3090 ti 24GB, 64 GB RAM + 150GB swap NVMe
Does 8 minutes per token seem correct? 3090 ti 24GB, 64 GB RAM + 150GB swap NVMe
@chrisbward too long, may be RAM is busy so much handling something other. The only python uses 70 GB on a 30B model, probably it will go faster for the next tokens, when system will realize to offload less layers to swap.
Typically, if a model fits into RAM completely (128 Gb), 30B model returns me a single token in a 7 seconds. 65B model in about 2 minutes after warm-up with swap, 13B model in a 2-3 seconds.
thanks @randaller - I'm seeing 1 second per token for 7B, 3-4 seconds for 13B, and 8 minutes per token for 30B - guess I need more RAM then!
User: tell me about london city
Takes about 8 minutes and the reply is;
Then immediately kicks off another generation, reply is;
And again just repeats