Closed vlbosch closed 1 month ago
@vlbosch Thanks for the fix. For the "<|im" issue, I think a better approach is to treat the EOS token just like any other stop word. My stopping condition approach is string-based instead of token-based unlike the original mlx-lm implementation as I believe it is more robust due to the effect of tokenization. One string can be in a different combination of token sequences depending it's position in the string. So I think it is safer to deal with stop words in string form instead of token form.
@vlbosch The problem you mentioned with "<im_" should be fixed now, it seemed to be trimming off a wrong number of characters after the detokenizer was finalised. Now instead of trimming, splitting is used instead. So the final chunk of text returned should be fine.
@nath1295 Thanks for the quick fixes! I can confirm now that it works correctly. Especially the prompt caching works great, so props for the hard work!
I'm running into some stream errors when stopping the generation manually, but I'll create a new issue for that.
…/python3.12/site-packages/mlx_textgen/utils.py", line 139, in stopping_criteria
return StopCondition(stop_met=True, trim_length=len(text.split(eos_tuple)[-1]) + eos_tuple[1]) | TypeError: must be str or None, not list
TODO: When streaming Qwen, the stream still ends with "<|im" -> check with other models if their eos is correctly trimmed or not.