outlines-dev / outlines

Structured Text Generation
https://outlines-dev.github.io/outlines/
Apache License 2.0
8.14k stars 411 forks source link

Outlines OOM w/ constrained json schema #658

Open fmmoret opened 6 months ago

fmmoret commented 6 months ago

Describe the issue as clearly as possible:

This uses up a huge amount of RAM and hangs for a long time -- 1600s on my machine with 32GB ram, current gen gaming laptop. Reducing some key constraints like constr(max_length=100) in d reduces memory usage & initial computation time in exponential fashion.

What are the recommended sets of workarounds? Is there a cacheless / lazy eval mode? Could we choose to split the JSON FSM into multiple & concat to reduce explosion at all of the control flow points in the regex?

Steps/code to reproduce the bug:

from pydantic import BaseModel, Field, conlist, constr
import outlines

class Example(BaseModel):
    a: str = Field(
        ...,
        max_length=60,
    )
    b: conlist(
        constr(max_length=20),
        max_length=5,  # type: ignore
    ) = Field(
        ...,
    )
    c: str = Field(
        ...,
        max_length=200,
    )
    d: conlist(
        constr(max_length=100),
        max_length=5,  # type: ignore
    ) = Field(
        ...,
    )
    e: conlist(
        constr(max_length=20),
        max_length=15,  # type: ignore
    ) = Field(...)
    f: conlist(
        constr(max_length=30),
        max_length=15,  # type: ignore
    ) = Field(
        ...,
    )

model = outlines.models.transformers(
    "gradientai/gradient-tinystories-15m"
)  # llama tokenizer + tiny model
# Construct structured sequence generator
from time import time

start = time()
generator = outlines.generate.json(model, Example)
end = time()
print(f"Time to construct generator: {end - start:.2f}s")

Expected result:

To not oom and to work fast.

Error message:

No response

Outlines/Python version information:

Version information

``` 0.0.27 Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ```

Context for the issue:

No response

lapp0 commented 6 months ago
outlines.fsm.json_schema.build_regex_from_object(json.dumps(Example.model_json_schema()))
'\\{[\\n ]*"a"[\\n ]*:[\\n ]*"(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,60}"[\\n ]*,[\\n ]*"b"[\\n ]*:[\\n ]*\\[[\\n ]*(("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,20}")(,[\\n ]*("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,20}")){0,4})?[\\n ]*\\][\\n ]*,[\\n ]*"c"[\\n ]*:[\\n ]*"(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,200}"[\\n ]*,[\\n ]*"d"[\\n ]*:[\\n ]*\\[[\\n ]*(("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,100}")(,[\\n ]*("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,100}")){0,4})?[\\n ]*\\][\\n ]*,[\\n ]*"e"[\\n ]*:[\\n ]*\\[[\\n ]*(("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,20}")(,[\\n ]*("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,20}")){0,14})?[\\n ]*\\][\\n ]*,[\\n ]*"f"[\\n ]*:[\\n ]*\\[[\\n ]*(("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,30}")(,[\\n ]*("(?:[^"\\\\\\x00-\\x1f\\x7f-\\x9f]|\\\\.){,30}")){0,14})?[\\n ]*\\][\\n ]*\\}'

Yes, this is a very complex pattern and results in over 5,000 FSM states, each of which contains a set of valid tokens generated by traversing the FSM.

I agree 100% with the idea of splitting unions into multiple FSMs. For concatenated patterns it might become more complicated.

Related: https://github.com/outlines-dev/outlines/discussions/640

fmmoret commented 6 months ago

Is there an elegant way you can picture doing so while still being able to use the vLLM & pydantic integrations?

lapp0 commented 6 months ago

It's quite complex, I've been thinking about this problem for a bit. I'll ping you when I post the write-up I've been working on. (Concatenation and union operations between RegexFSM are also critical to performance of CFGFSM, but it relates here as well).

rlouf commented 6 months ago

I see a few ways we could dramatically reduce the memory that these FSMs take:

  1. Make sure that the FSM cannot be further reduced (no redundant states)
  2. Reduce the size of the largest lists into ranges of tokens to a state.

Which solution we should prioritize is very much an empirical question at this point.

fmmoret commented 6 months ago

Is there something tangential out there we could use instead of fsms?

This guy is claiming 10,000-100,000 times faster json constraining. https://twitter.com/jrysana/status/1765687350363906278

rlouf commented 6 months ago

Is there something tangential out there we could use instead of fsms?

There are a few things to try before completely replacing the DFA logic. For instance, how many of those transitions correspond to "allow all tokens and go to the same state"? If many, we could treat this as a special case and it would decrease the memory requirements dramatically. We could store ranges instead of token ids as well.

This guy is claiming 10,000-100,000 times faster json constraining. https://twitter.com/jrysana/status/1765687350363906278

We are yet to see the code. Anyway that's compilation time, and a few things can be done to improve this in Outlines we just didn't have the time to get around to it.