Closed tbogdala closed 3 weeks ago
Yeah the change in the example part indeed seems a bit complex. Maybe we should just have the model change in this PR so that users of the candle-transformers
crate can benefit from it and we don't need to adapt the example for now.
This PR is discussed in #2108 and handles mask creation for the Llama model that allows for processing a user supplied prompt in token batches instead of all at once. The key change was to
Cache::mask()
, adding a secondusize
and then creating the appropriately sized vector to turn into aTensor
there.The code in candle-examples/examples/llama/main.rs in this PR may need smoothing, but other than that, I've tested the example with and without the new
--prompt-batch-size
CLI parameter and at a variety of sizes.