FlatDolphinMaid-8x7B replying to itself

ylhan commented 7 months ago

Hey spuun @spuuntries , I've been using your model and it seems to work great, but I've tried to add history/context to the model by adding a ### History: section to the prompt with langchain.

### Instruction:
{system_prompt}

### History:
{history}

### Input:
{prompt}

### Response:

The history of the conversation is injected with AI: and Human: tags. The issue is that this causes the model to begin to reply to itself and outputting ### Input: and ### Response:, stopping only when it reaches the max token limit.

Would be super grateful for some guidance here. Do I need to wrap the input with <s> or [INST]? Or is there a better way to pass in history/context?

ylhan commented 7 months ago

Found this https://huggingface.co/Undi95/FlatDolphinMaid-8x7B/blob/main/special_tokens_map.json (Undi95/FlatDolphinMaid-8x7B). I tried to wrap my history in <s> and </s> but that didn't seem to help

ylhan commented 7 months ago

replicate com_p_e7uoqstb6gnx4nbd75l7ed7o44 Also tried it this way without any luck.

spuuntries commented 7 months ago

Ello ello, sorry for the late resp, just saw this. You can't use a custom insertion point in the prompt_template, i.e. {history}, atm (and this is a won't fix afaict for now).

As for the model replying to itself, right now there really isn't a 100% fool-proof way to go about doing this. What I've been doing, is formatting it like this (with the default prompt_template): and then slicing it with a regex but this does break at times still.

What we can do, and afaik what a lot of other implementations do, is add a stop parameter as supported by llama_cpp_python (this'll stop generation early the moment it encounters the specified strings, so e.g. have it stop at ###) then trim the excess.

But that's not in the cog atm, as when I made the cog, I just wanted to have a drop-in interop with my codebase. I'll add it when I have the time to, these model files are a bit big and the way I'm formatting my cogs aren't super good for continuous development (i.e., I have to upload the entire model file with the changes every time I make a change).

spuuntries commented 7 months ago

Oh, you can also just, limit the responses' max token to reduce the possibility of "overflowing", I guess. Most responses would only take around 200-256 tokens max if you're doing RP, at least.

ylhan commented 7 months ago

Thank you so much Spuun! I'll try using regex to parse out the first response instead. I'm a bit confused by this part tho

What we can do, and afaik what a lot of other implementations do, is add a stop parameter as supported by llama_cpp_python (this'll stop generation early the moment it encounters the specified strings, so e.g. have it stop at ###) then trim the excess.

How is this different then just using regex to extract everything up to the first ###?

Also I'm a bit curious, what would fixing this look like? I'm a bit surprised this is an issue at all because I was able to include history and have the base model (mistralai/mixtral-8x7b-instruct-v0.1:5d78bcd7a992c4b793465bcdcf551dc2ab9668d12bb7aa714557a21c1e77041c) output just the response with the following prompt:

<s>[INST]
Answer any human prompt with this context:
{history}
</s>
[/INST]
<s>
[INST]
{input}
[/INST]

spuuntries commented 7 months ago

How is this different then just using regex to extract everything up to the first ###?

Not much really, just that stopping early would cut down on costs, since you wouldn't be "overflowing" in the first place.

As for the other question,

output just the response with the following prompt

the reason it does this is that </s> is the EOS token for mixtral-8x7b-instruct-v0.1 (and FlatDolphinMaid-8x7B) and it's essentially parsed the same way as the stop parameter. When generation encounters it, it stops early.

The reason we're encountering the issue atm is that ### doesn't count as an EOS by default, so generation doesn't stop early when encountering it.

You don't even need to include the duplicate <s> and </s> afaik technically, and just go:

<s>[INST]
Answer any human prompt with this context:
# insert history here
[/INST]
[INST]
{prompt}
[/INST]

I'm surprised your template works at all before tbh, since it has multiple BOS and EOS, I was expecting the model to break. BOS and EOS are usually supposed to be handled by the preprocessor, not manually inputted.

Looking at how mistralai/mixtral-8x7b-instruct-v0.1 is implemented, they're using VLLM as the inference engine (my cogs use llama-cpp-python), so maybe it has a different way of preprocessing things.

ylhan commented 7 months ago

Thanks for the pointers here! I'll try including the dialogue in the prompt itself.

ylhan commented 7 months ago

Might be a dumb question but why was the prompt format changed from the original mistral prompt at all? Does it offer some advantage that I'm not aware of?

spuuntries commented 7 months ago

Whoops, just saw this.

Ngl, not entirely sure, but that is what Undi recommended on his original model release page. It's more likely that it's because NeverSleep/Noromaid-v0.1-mixtral-8x7b-Instruct-v3, the base of this model merge, was finetuned on the Alpaca format.

I guess you can try using the original format and get back to me ig lol, cuz I've never tried it either tbh, but I feel like it's likely to break a good bit of the capabilities of the model.

spuuntries / rp-cogs

FlatDolphinMaid-8x7B replying to itself #1