Cache repeated string instances in the lexer (.NET 9)

alexrp commented 1 year ago

When lexing a typical source file, there's going to be a lot of repeated strings - identifiers, literals, white space, and so on. We can't intern these, but it would make good sense to cache tokens up to a certain length and return the same instance instead of building them up repeatedly.

To implement this, instead of building up the token string in a StringBuilder, we would keep track of where the token starts and ends. When creating the token, if the length is below our caching threshold, we first look it up in the token cache. For larger tokens, we shouldn't bother as the lookup will take too long to be worth it.

alexrp commented 1 year ago

Along with this work, we should also create lexed strings through SourceText.ToString(SourceTextSpan).

alexrp commented 4 months ago

https://github.com/dotnet/runtime/issues/27229 should make this quite a bit easier to implement in .NET 9.

vezel-dev / celerity

Cache repeated string instances in the lexer (.NET 9) #38