Closed ngxson closed 4 months ago
Quantized model is not usable (seems like flan requires a lot of precision)
FP16 (answer is a bit more correct, but in french we never use "être" for asking age):
INT8 (answer is wrong):
@felladrin I still can't get reliable results, but seems like problem comes from llama.cpp and not wllama.
This PR will be merged now.
Thank you for implementing it, @ngxson! I tested it with https://huggingface.co/Felladrin/gguf-MaxMini-Instruct-248M and it worked great! Inference was considerably slower than a 248M decoder-only, but encoder-decoder models still have their uses!
@felladrin Thanks for the info. I'm not sure why it's significant slower, probably something to be optimized from upstream.
And yeah I agree that encoder-decoder models are still useful. Personally I found that for more deterministic tasks like translating languages, it hallucinates less than decoder-only.