noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

Using llama-cpp-python logits escapes RegEx constraints. Negative Regex Charset Matches don't work? #70

Open michael-newsrx opened 4 months ago

michael-newsrx commented 4 months ago

I'm using patterns such as ([^"][^\n]+[^"]) and ([^"]([^>/;:\n]+)( > [^>/;:\n]+)+[^"]) which contain negated character matches, but I'm still getting the unwanted characters?

Is this a bug or does not the regex library support negative matches? I looked at the regex library site, and it was unclear to me what it supports or does not support.

michael-newsrx commented 4 months ago

Example getting unwanted commas and there are no matching semicolons?

Pattern: - Keywords: [^;:,/\n\r]+; [^;:,/\n\r]+; [^;:,/\n\r]+; [^;:,/\n\r]+; [^;:,/\n\r]+

LLM Output: - Keywords: intranasal vaccine, long-lasting immunity, mucosal antibody response, T cells, adjuvants, IgG antibodies

noamgat commented 4 months ago

Very weird. I added a unit test to check for this exact case here: https://github.com/noamgat/lm-format-enforcer/commit/7cfb693495e9ba3305e230cc62e05b383b6c717a And it passes (good test to have anyway so I'm keeping it). Can you share a reproducing sample? Maybe there's a UTF character very similar to comma that the LLM is outputting? For this specific case, maybe positive charset (a-zA-Z etc) would work better?

michael-newsrx commented 4 months ago

Attached is an example jupyter notebook which exhibits the LLM escaping the regex. I've also attached the python environment.yml for the test.

This is for the Llama-cpp-python GGUF Mixtral-8x7b-Instruct-0.1. The code assumes a 24GB GPU. See function load_mixtral_8x7b_Q4 to adjust. (variables offload_kqv and layers). You also might need to adjust the parameter n_threads.

bug_reproduce.ipynb.tar.gz

environment.yml.tar.gz

michael-newsrx commented 4 months ago

Anyone looking into this?

noamgat commented 4 months ago

I have limited time to work on LMFE, and the last release was focused around TensorRT-LLM. I hope to get to this for the next release.