Closed bibekyess closed 10 months ago
Maybe the issue arises because of TabbyAPI. I created my own FastAPI server and I am not facing such issues.
I've been working on this for most of the day. I have HF tokenizers working more or less, there's just a few kinks to iron out because the EXL2 tokenizer does more than just wrap around SentencePiece. Also apparently bugs in the HF Tokenizer implementation that I have to work around. But getting there.
Great to hear that! Thank you! :)
It should be working now. I've mainly focused on deepseek, and HF tokenization is a deep, deep rabbit hole, so I'll probably need to test a lot more models to fix various edge cases.
It could be. How are you using the model?
I know the system prompt is inconsistent across the deepseek models. You could try with ExUI which I've confirmed works well, at least with my own conversions of 67B-chat and the built-in deepseek prompt format.
Okay, so I was able to test that model, and it seems to be fine. The issue you're having is probably that the model was finetuned with a RoPE scaling factor of 4, and ExLlamaV2 doesn't (yet?) automatically read that from the config. But if you run the chat example with -rs 4
it should work. It also seems to be a bit sensitive to repetition penalty, so I would lower it from the default (1.15) to something like -repp 1.05
.
As for running multiple queries on an empty context, I guess it would be a simple feature to add, but the chatbot isn't really meant to by cluttered with too many funky features. ExUI is more advanced with sessions, model loading/unloading, notepad mode etc.
Well, like I said it's a simple thing, so I just added it, cause why not. Run the chatbot with --amnesia
and it will forget the context after each response.
Yes, there are two buttons I need to click. Ahem.. try it again. (:
If you give it some programming task where it needs to generate two long functions the output starts glitching after 50 lines of code or so:
This could be related to the RoPE scaling. If the model was converted without that setting, the calibration is going to be very off.
From what experiments I and others have done, the calibration dataset doesn't ultimately matter that much. But it's probably a good idea to have some code in there, if nothing else then to make sure all the "coding tokens" and their embeddings are accounted for.
As for presets, no. I don't really believe in samplers as a way to fix bad predictions from language models. So the default is just top-K+top-P, with a slight repetition penalty (which is probably a bit too high by default), and everything else is just there because people have requested it. Locally typical sampling has some good theory behind it, I guess?
Cleaning up some issues, and this is technically completed.
Hello ! I found some Llama2 models using the
FastTokenizer
provided by the Hugging Face tokenizers library, not theSentencePiece
package used by regular Llama models. For instance,beomi/llama-2-ko-7b
. It seems the currentExLlamaV2Tokenizer
only support SentencePiece tokenizer, which requirestokenizer.model
, can you please add support for Hugging Face tokenizers as well? I tried changing the filetokenizer.py
to accomplish it and it worked well in exllama but in exllamav2, it works well sometimes but sometimes it gives following errors.It is interesting that sometimes, the inference is successful, while sometimes it is not. So, I want to request for official support of the Huggingface Fast Tokenizers.
Thank you! :)