Open fahadh4ilyas opened 3 months ago
It's hard to guarantee a particular output because it's a dynamic process and the output can vary wildly with small differences in initial conditions or numerical precision. The response you're getting isn't wrong, it's just not in the language you seem to be expecting. Since you haven't instructed the model to respond in Indonesian, and the system prompt is in English, it's likely going to be somewhat random which path the sampler chooses to go down.
Now, I'm not sure which of the examples you're referring to, but the default sampling settings for most of them are:
Since you're not supplying any settings to the HF generator, you'd want to check what the defaults are. I believe it defaults to greedy sampling? If so it could be that the English response is simply more likely (given the English system prompt, for instance), meaning my examples would also give you an English response most of the time, but the randomness allows it to choose differently sometimes.
Anyway, try to match those settings in model.generate
. For consistency, though, I would look at the system prompt. Either write it in Indonesian or add an instruction to respond in the language of the question being asked.
It's hard to guarantee a particular output because it's a dynamic process and the output can vary wildly with small differences in initial conditions or numerical precision. The response you're getting isn't wrong, it's just not in the language you seem to be expecting. Since you haven't instructed the model to respond in Indonesian, and the system prompt is in English, it's likely going to be somewhat random which path the sampler chooses to go down.
Now, I'm not sure which of the examples you're referring to, but the default sampling settings for most of them are:
- repetition penalty: 1.025 (should be equivalent to the HF implementation, but I'm not 100% on that)
- temperature: 0.8
- top-K: 50
- top-P: 0.8
Since you're not supplying any settings to the HF generator, you'd want to check what the defaults are. I believe it defaults to greedy sampling? If so it could be that the English response is simply more likely (given the English system prompt, for instance), meaning my examples would also give you an English response most of the time, but the randomness allows it to choose differently sometimes.
Anyway, try to match those settings in
model.generate
. For consistency, though, I would look at the system prompt. Either write it in Indonesian or add an instruction to respond in the language of the question being asked.
My model is already fine tuned to understand indonesian and english prompt and answer it accordingly. I already test it in non-quantized mode and the response is what I want. But, somehow my HF implementation of exllamav2 is forcing the model to answer in english. Even after I ask it to answer in Indonesian, it begins to answer the first half in english and then suddenly in Indonesian.
But, using your generation implementation, it's not the case. It always answers in Indonesian (albeit not really a good answer vs the non quantized version but still it answers in Indonesian). That's what surprise me and makes me think that maybe there is something wrong with the way I implement it.
Now, I'm not sure which of the examples you're referring to, but the default sampling settings for most of them are:
- repetition penalty: 1.025 (should be equivalent to the HF implementation, but I'm not 100% on that)
- temperature: 0.8
- top-K: 50
- top-P: 0.8
I test this and the result even more weird and nonsensical. Sometime it repeats the question, sometime it's wondering around like a drunk AI.
So, I've been testing to generate text using exllamav2 with some config following huggingface generator. Here is my script
Here is what inside my directory
Here is how I generate text:
Here is the result from my code:
But, using your generation example, my result is here
And what I want is your generator's answer. Is there something wrong with my implementation? Could you help me find it? I actually wanna keep using HF implementation because I want to unite the logits processor across all kind of model. That's why I make that implementation.