Closed akashicMarga closed 2 weeks ago
I think it should actually work out of the box, e.g. with:
cargo run --features cuda --profile=release-with-debug --example llama -- --model-id HuggingFaceTB/SmolLM2-1.7B --which v32-1b
(note that it's an unquantised version though)
Huggingface SmolLM would be a great addition to Candle. From the config, it's similar to LLama-based models. Can we use it directly with the candle quantisation example of LLAma/Tinyllama? if not I can give it a try.
It's also a good candidate for Wasm/Webgpu for client-side browser inferencing.
https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9