Support llama_encode (WIP)

ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference

https://huggingface.co/spaces/ngxson/wllama

MIT License

441 stars 21 forks source link

Support llama_encode (WIP) #91

Closed ngxson closed 4 months ago

ngxson commented 4 months ago

await wllama.loadModelFromUrl("https://huggingface.co/Felladrin/gguf-flan-t5-large/resolve/main/flan-t5-large.Q2_K.gguf", {
  n_ctx: 1024,
});

output = await wllama.createCompletion("translate English to French: How old are you?", {
  nPredict: 20,
  sampling: { temp: 0 },
});

// output:  Les âges de vous êtes-vous?
// expected: Vous avez quel âge ?

ngxson commented 4 months ago

Quantized model is not usable (seems like flan requires a lot of precision)

FP16 (answer is a bit more correct, but in french we never use "être" for asking age):

INT8 (answer is wrong):

ngxson commented 4 months ago

@felladrin I still can't get reliable results, but seems like problem comes from llama.cpp and not wllama.

This PR will be merged now.

felladrin commented 4 months ago

Thank you for implementing it, @ngxson! I tested it with https://huggingface.co/Felladrin/gguf-MaxMini-Instruct-248M and it worked great! Inference was considerably slower than a 248M decoder-only, but encoder-decoder models still have their uses!

ngxson commented 4 months ago

@felladrin Thanks for the info. I'm not sure why it's significant slower, probably something to be optimized from upstream.

And yeah I agree that encoder-decoder models are still useful. Personally I found that for more deterministic tasks like translating languages, it hallucinates less than decoder-only.