Two regexes that should be equivalent, aren't

wjn0 commented 4 months ago

The bug I've got two regexes which I think should be equivalent, based on what I know about regexes in Python: '[\s\S]+' and '[.]+'. They produce different output.

To Reproduce Give a full working code snippet that can be pasted into a notebook cell or python file. Make sure to include the LLM load step so we know which model you are using.

from transformers import AutoModelForCausalLM, AutoTokenizer, PretrainedConfig

import guidance

model_name = "gpt2"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)

lm += 'A phone number: "'
# The following _should be_ equivalent IMO:
# lm += guidance.gen(max_tokens=10, regex='[\s\S]+', stop_regex='"')
lm += guidance.gen(max_tokens=10, regex='[.]+', stop_regex='"')
print(lm)

System info (please complete the following information):

OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): RHEL 9
Guidance Version (guidance.__version__): 0.1.15

wjn0 commented 4 months ago

Decent chance this is not in fact a bug, in which case it would be super cool if there's a reference as to what kinds of regexes/grammars are valid, whether this is some edge case, or I'm missing something super obvious :) cheers and thanks for a great library.

hudson-ai commented 4 months ago

This is in fact a bug; thanks for reporting it! The underlying library we use to parse regular expressions isn't in total alignment with python's regex engine. What you're seeing is that it's parsing \S as a literal S rather than [^\s]. I'm currently working on a PR to fix these regex issues (#854), just FYI :)

wjn0 commented 4 months ago

Thanks very much for the info! I'll follow that PR.

guidance-ai / guidance

Two regexes that should be equivalent, aren't #872