noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

[REQUEST] EBNF parsing #78

Open bdashore3 opened 4 months ago

bdashore3 commented 4 months ago

This library is amazing and I'm currently using it in TabbyAPI to constrain JSON generation. However, it would also be great if this library can parse EBNF format. This should allow for the ability to specify an abstract grammar, opening for more flexibility.

Currently, I'm using Outlines to accomplish this (it uses the Lark parser under the hood), but it's extremely heavy and slow (and requires another dependency).

Relevant information: Lark: Grammar documentation (with examples) Outlines: CFG FSM Transformers: Pull Request

noamgat commented 4 months ago

This could indeed be a great contribution. If someone would create a PR I would review and accept it. Until then, I am leaving this open to see how much demand for it there is. Vote with your emojis!

bdashore3 commented 2 months ago

@noamgat it's been a couple of months, and it looks like there's a lot of demand for this feature. I took a closer look at the implementations I linked in the first issue and it turns out that the creator of the transformers PR created a library here. It would be best to implement the concepts of this library into LMFE since it parses GBNF formatting (which is what most grammars use anyway).

Even though grammars aren't used too much (I personally recommend migrating to JSON schemas), implementing GBNF style formatting from here rather than Outlines EBNF makes most grammar files and tools compatible without rewrites.

noamgat commented 2 months ago

Thanks for looking into it. From what I see in the library you posted, that library is token-centric, while LMFE's CharacterLevelParser API is character-centric. I don't know how easy it will be to convert between the two. However, if someone is to find a way to do it and implement it with the CharacterLevelParser API, I will happily approve the PR.