support for Unicode regular expressions

skunkwerk commented 1 month ago

What behavior of the library made you think about the improvement?

I'm trying to restrict the output of a multi-lingual LLM to a single language (Korean), as it was trained in multiple languages and sometimes mixes them in the output.

with this regular expression:

([0-9\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F\uA960-\uA97F\uD7B0-\uD7FF\n,\s\.\']*)

I get the error:

Interegular: regex module unicode properties are not supported

How would you like it to behave?

There should be a way to restrict output to a specific language's character set.

plaunezkiy commented 4 days ago

The grammar is parsed via Lark (have a look in their docs, import unicode functionality and try again) https://lark-parser.readthedocs.io/en/stable/grammar.html#import

lapp0 commented 4 days ago

It's not obvious to me why your expression fails, but generator = generate.regex(model, r'[😨]+') works. Maybe we need to update Outlines so it allows escaped unicode along with literal unicode?

Could you leave the issue open so we can address this at some point, but for now, try the literals instead? e.g. instead of \uAC00 use 가

outlines-dev / outlines

support for Unicode regular expressions #937

What behavior of the library made you think about the improvement?

How would you like it to behave?