Lazy-load lexers - Githubissues

ashmaroli commented 5 years ago

As of Rouge-3.3.0, all lexers are loaded into memory via require 'rouge'. Now that more and more lexers with numerous rules are being introduced, Rouge is going to bloat the memory unnecessarily. The chances of a consumer using a major percentage of the bundled lexers are very low.

Therefore, I would like to propose to have the lexers be loaded into memory as necessary similar to how Ruby's Module#autoload functions.

Whatsay @pyrmont, @dblessing, @gfx, @jneen?

pyrmont commented 5 years ago

Ha! I was wondering about this myself the other day and had it on my list of things to think about the mechanics of doing! :)

Off the top of my head, I'd guess the issue is that there's important information in each lexer that Rouge uses to identify which lexer to choose. You'd still want that to be available. Maybe what's easier is lazy loading of the rules?

ashmaroli commented 5 years ago

Ha! I was wondering about this myself

Awesome! :smiley:

that there's important information in each lexer that Rouge uses to identify which lexer to choose.

I've not familiarized myself with inner workings of Rouge, so I may be way off-track here. I was thinking more along the lines of using Rouge::Lexer.find_fancy(language, content) to determine what lexer to load based on given language..

pyrmont commented 5 years ago

@ashmaroli wrote:

I was thinking more along the lines of using Rouge::Lexer.find_fancy(language, content) to determine what lexer to load based on given language..

Right, but language is 'registered' by each lexer (I think). So if you're going to find the lexer with the matching name, you necessarily need to have the objects loaded in memory to check. I mean you could load each in as you check it but:

you'll likely load a lot of lexers before you find the one you're looking for; and
the loading of each lexer will still cause all the rules to be created and I'd assume that's where a lot of the memory usage comes from.

See here:

https://github.com/rouge-ruby/rouge/blob/b1e81b83109f512c6c762acd1e5dae271ab56fa7/lib/rouge/lexer.rb#L33-L77

ashmaroli commented 5 years ago

all the rules to be created

If that's the case, we could have the most commonly occurring regexps stashed in named constants under the RegexLexer.. :thinking:

pyrmont commented 5 years ago

@ashmaroli wrote:

If that's the case, we could have the most commonly occurring regexps stashed in named constants under the RegexLexer.. 🤔

The Rouge::RegexLexer::Rule objects consist of a regular expression but also a block. Common regular expressions would possibly be helpful but there'd still be a lot of object creation.

Here are some initial thoughts on possible strategies:

cache registry of lexer names and aliases (either on our side when we release a gem or on the user side when they run Rouge for the first time);
extract names and aliases from lexer files using string scanning and use that to build the registry;
guess filename for lexer based on language name and only load all lexers if that fails (or it isn't provided).

Strategy 3 is the simplest but as someone who typically expresses languages using their aliases (eg. rb for Ruby), I'd probably prefer 1.

ashmaroli commented 5 years ago

I fail to see the point of using a cache registry of lexer names and aliases to selectively load a lexer since they're all unique. What is being cached..? If you want to prevent multiple loading of lexers, it'd make sense to use require instead of load but IIRC, @jneen was not open to making that switch.

But if you were suggesting a cache for holding common patterns in multiple lexers, that makes sense.

pyrmont commented 5 years ago

This cache would hold references to the filename of the relevant lexer for each name and alias. If Rouge was invoked with the name of the language (or its alias), Rouge could look it up in this cache and then load only that lexer.

ashmaroli commented 5 years ago

Basically what require does with $LOADED_FEATURES. If invoking a lexer with :ruby caused lib/rouge/lexers/ruby.rb to be loaded into memory. invoking a lexer with :rb would simply return false... correct?

pyrmont commented 5 years ago

Strategy 3 would work like that. In your example, the language specified is :ruby and there's a file called ruby.rb in lib/rouge/lexers/ so that'd match and Rouge would load that and only that lexer.

The limitation would be if you used an alias. Since there's no file called rb.rb if the language specified was :rb, Rouge would have to fall back on loading all lexers so that a complete registry could be populated and searched instead.

ashmaroli commented 5 years ago

Rouge would have to fall back on loading all lexers so that a complete registry could be populated and searched instead.

Oh! I did not realize that. Now your strategy sounds like a good first-step in the right direction. :+1:

pyrmont commented 5 years ago

Hmmm, after further review, I think I may have taken us down a rabbit hole. I suspect @jneen has (unsurprisingly) done her best to avoid unnecessary object creation.

If I'm understanding the way it works properly (not certain), here's how I think Rouge starts up:

loading lib/rouge.rb causes all lexers to be loaded:

https://github.com/rouge-ruby/rouge/blob/d8cedb39ce3229140f1f808aeedc18b384711e61/lib/rouge.rb#L55
when a lexer is loaded, the Rouge::Lexer.tag and Rouge::Lexer.aliases class methods associate the tag/alias with a reference to the lexer in a hash called registry:

https://github.com/rouge-ruby/rouge/blob/d8cedb39ce3229140f1f808aeedc18b384711e61/lib/rouge/lexer.rb#L205-L220
in addition, when each lexer is loaded, the Rouge::RegexLexer.state class method is invoked for each call to state in the lexer and this in turn creates blocks (state is always called with a block) but the objects within the blocks are not loaded:

https://github.com/rouge-ruby/rouge/blob/f8f229598c15ff88c29c19d263032f34f757ed75/lib/rouge/regex_lexer.rb#L183-L188

If this current understanding is correct, my suggestion would be less useful than it initially seemed. I now don't think the objects in the block passed to state are evaluated at load time.

So the question should probably be how big are the lexer objects themselves? If they're not that big, I'm not sure how much gain there'd be from loading them selectively.

ashmaroli commented 5 years ago

I use a gem called memory_profiler to gauge usage locally. Perhaps the following script will help you as well:

# frozen_string_literal: true

require 'bundler/setup' # My Gemfile is pointing to the latest rouge:master 
require 'memory_profiler'

report = MemoryProfiler.report { require 'rouge' }
report.pretty_print(to_file: 'rouge-memory.tmp', scale_bytes: true)

pyrmont commented 5 years ago

That's helpful! My initial test showed about 10MB of memory usage in the Rouge gem just from loading. That seems like maybe a lot but maybe not. Were you seeing the same kind of profile, @ashmaroli?

ashmaroli commented 5 years ago

Were you seeing the same kind of profile

Yes. Most of those allocations and retainments are due to loading all the lexers into memory. For example the string reports will show you that "root" has been allocated and retained x times where x equals the current no. of lexers available.

jneen commented 5 years ago

That part is interesting and may be a bug - :root should be a symbol, which should be shared memory...

jneen commented 5 years ago

(also iirc Ruby does some memory-sharing magic with strings under a certain length, about 20-something)

pyrmont commented 5 years ago

@jneen wrote:

That part is interesting and may be a bug - :root should be a symbol, which should be shared memory...

Do you think it's because of this?

https://github.com/rouge-ruby/rouge/blob/6c7da8adaab1aa9deb4564ba7d40b47e93872524/lib/rouge/regex_lexer.rb#L186

I saw it as I was writing up the explanation yesterday and my initial thought was that this is going to cause every key to be turned into a string that's then stored in memory.

jneen commented 5 years ago

Welp, that might be the culprit there. I think I was originally trying to avoid symbol/string ambiguity, but it's definitely not worth the memory bloat for internal use.

pyrmont commented 5 years ago

I wondered if maybe to_sym might be a better alternative in that case? Or do you think not worth worrying about at all, really?

jneen commented 5 years ago

Since some people may have written their own lexers that aren't part of rouge, I think to_sym is a good choice for compatibility's sake. And it's free if it's already a symbol. The only danger with to_sym is passing it user data, since symbols are never gc'd, but that's not a problem here.

pyrmont commented 5 years ago

Hmmm, I suppose if we did use symbols as keys, there is one area where user data could leak in. When a user defines a code block in Markdown, they can (depending on the flavour) specify the language and that would then need to be turned into a symbol to check against the registry hash.

The immediate solution that comes to mind is to truncate this language string before turning it into a symbol. Could be a frustrating bug to track down if you have a long language name so should probably document it if that is the approach taken.

jneen commented 5 years ago

That's true for lexer tags, but state keys should be fully invisible to the user.

pyrmont commented 5 years ago

B'oh—right you are! Was a bit too sleepy this morning :)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity for more than a year. It will be closed if no additional activity occurs within the next 14 days. If you would like this issue to remain open, please reply and let us know if the issue is still reproducible.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity for more than a year. It will be closed if no additional activity occurs within the next 14 days. If you would like this issue to remain open, please reply and let us know if the issue is still reproducible.

rouge-ruby / rouge

Lazy-load lexers #1135