Open michael-newsrx opened 4 months ago
Example getting unwanted commas and there are no matching semicolons?
Pattern: - Keywords: [^;:,/\n\r]+; [^;:,/\n\r]+; [^;:,/\n\r]+; [^;:,/\n\r]+; [^;:,/\n\r]+
LLM Output: - Keywords: intranasal vaccine, long-lasting immunity, mucosal antibody response, T cells, adjuvants, IgG antibodies
Very weird. I added a unit test to check for this exact case here: https://github.com/noamgat/lm-format-enforcer/commit/7cfb693495e9ba3305e230cc62e05b383b6c717a And it passes (good test to have anyway so I'm keeping it). Can you share a reproducing sample? Maybe there's a UTF character very similar to comma that the LLM is outputting? For this specific case, maybe positive charset (a-zA-Z etc) would work better?
Attached is an example jupyter notebook which exhibits the LLM escaping the regex. I've also attached the python environment.yml for the test.
This is for the Llama-cpp-python
GGUF Mixtral-8x7b-Instruct-0.1. The code assumes a 24GB GPU. See function load_mixtral_8x7b_Q4
to adjust. (variables offload_kqv
and layers
). You also might need to adjust the parameter n_threads
.
Anyone looking into this?
I have limited time to work on LMFE, and the last release was focused around TensorRT-LLM. I hope to get to this for the next release.
I'm using patterns such as
([^"][^\n]+[^"])
and([^"]([^>/;:\n]+)( > [^>/;:\n]+)+[^"])
which contain negated character matches, but I'm still getting the unwanted characters?Is this a bug or does not the regex library support negative matches? I looked at the regex library site, and it was unclear to me what it supports or does not support.