outlines-dev / outlines

Structured Text Generation
https://outlines-dev.github.io/outlines/
Apache License 2.0
8.13k stars 410 forks source link

Make `CFGFSM` LALR(1) #588

Open lapp0 opened 7 months ago

lapp0 commented 7 months ago

What behavior of the library made you think about the improvement?

Currently CFGFSM is LR(0).

Before @benlipkin's changes in https://github.com/outlines-dev/outlines/pull/544, terminals would always terminate if they legally could. Now they only terminate if the next generated token isn't valid in the current RegexFSM.

This is a substantial improvement, and brings us closer now terminal evaluation is greedy, but the parser still isn't LALR(1).

Consider the grammar

?start: (COMMENT | REGEXP)*

COMMENT: "/\s*/" "//" /[^\n]/*
REGEXP: "/aaa/"

%import common.WS_INLINE
%ignore WS_INLINE

The first RegexFSM will be a union of COMMENT | REGEXP | WS_INLINE

with a generation of / (note the white-space prefix), we will traverse the FSM. It will not terminate because COMMENT is still valid. But now we cannot legally terminate the FSM, and the only legal next character is another / - a continuation of COMMENT.

How would you like it to behave?

The next character, a should be legal because the generation /aaa/ follows the grammars production rules and would lex as (WS_INLINE, REGEXP).

Being an iterative parser, Outlines cannot actually "look ahead", so we need a fallback mechanism that guarantees all production rules are considered at each step in order to achieve LALR(1).

brandonwillard commented 6 months ago

See the code in the parsing module.