Open irthomasthomas opened 6 months ago
There is no technical reason why we can't use a second speculative decoder model. Making three models in total.
This model is 164M parameters. An order of magnitude larger would be 1.6B, and an order of magnitude again would still only be 16B parameters. At 8bit that makes about 18GB + the context window. I have a 12GB 3060, and an older, slower 8GB 1080. So this is doable in vram, but I'm not sure if that 1080 is actually faster than the GPU.