Large grammar causes error "Tokenizer tables too big to represent with 16-bit offsets"

lezer-parser / lezer

Dev utils and issues for the Lezer core packages

33 stars 1 forks source link

Large grammar causes error "Tokenizer tables too big to represent with 16-bit offsets" #33

Closed nedgrady closed 1 year ago

nedgrady commented 1 year ago

I'm trying to write a grammar that looks like the following:

@top Program { Field ">=30" }
Field { "ALC_PCT" | "NUC_WEAP" | "FOS_FUEL" }

n.b. The big-picture goal is to allow users to create a well-formed list of predicates with the syntax { Field Operator Value }. The Operator/Value rules are working perfectly.

The error occurs because there's over 3000 options for the fields. Here's a code sample that creates the error:

const options = []
for (let i = 0; i < 3000; i++) {
    // add a random string to the array
    options.push('"' + Math.random().toString(36).slice(2, 7) + '"')
}

const parser = buildParser(`
@top Program { Field ">=30" }
Field { ${options.join("|")} }
`)

Is it possible to get around this error anyway, or does it require a change to the Lezer library itself?

Any help/advice very much appreciated. Thanks for the awesome library!

marijnh commented 1 year ago

Is it possible to get around this error anyway, or does it require a change to the Lezer library itself?

Possibly you can set things up to not generate different tokens for each string if those strings play the same syntactic role (maybe using an external specializer).

Extending the size of term IDs beyond 16 bit would greatly blow up the size of grammars and memory footprint of parses, and it not something that's going to be changed in Lezer.

nedgrady commented 1 year ago

Is it possible to get around this error anyway, or does it require a change to the Lezer library itself?

Possibly you can set things up to not generate different tokens for each string if those strings play the same syntactic role (maybe using an external specializer).

Extending the size of term IDs beyond 16 bit would greatly blow up the size of grammars and memory footprint of parses, and it not something that's going to be changed in Lezer.

Thanks for the quick reply - yeah all of those strings always have the same syntactic role. Is this the part of the docs I'm looking for? https://lezer.codemirror.net/docs/guide/#token-specialization

marijnh commented 1 year ago

Yes, that and the @external specialize syntax in the section after it (or @external token, if you want to match the token's extent in a script).

nedgrady commented 1 year ago

Let the record show to future readers I did not manage to pull this off - time constraints and lack of skill/understanding of the docs have taken their toll.

Workaround is to just check the field names when the predicates are being evaluated later on.