Closed ashmaroli closed 3 years ago
Ha! I was wondering about this myself the other day and had it on my list of things to think about the mechanics of doing! :)
Off the top of my head, I'd guess the issue is that there's important information in each lexer that Rouge uses to identify which lexer to choose. You'd still want that to be available. Maybe what's easier is lazy loading of the rules?
Ha! I was wondering about this myself
Awesome! :smiley:
that there's important information in each lexer that Rouge uses to identify which lexer to choose.
I've not familiarized myself with inner workings of Rouge, so I may be way off-track here.
I was thinking more along the lines of using Rouge::Lexer.find_fancy(language, content)
to determine what lexer to load based on given language
..
@ashmaroli wrote:
I was thinking more along the lines of using
Rouge::Lexer.find_fancy(language, content)
to determine what lexer to load based on givenlanguage
..
Right, but language
is 'registered' by each lexer (I think). So if you're going to find the lexer with the matching name, you necessarily need to have the objects loaded in memory to check. I mean you could load each in as you check it but:
you'll likely load a lot of lexers before you find the one you're looking for; and
the loading of each lexer will still cause all the rules to be created and I'd assume that's where a lot of the memory usage comes from.
See here:
all the rules to be created
If that's the case, we could have the most commonly occurring regexps stashed in named constants under the RegexLexer.. :thinking:
@ashmaroli wrote:
If that's the case, we could have the most commonly occurring regexps stashed in named constants under the RegexLexer.. 🤔
The Rouge::RegexLexer::Rule
objects consist of a regular expression but also a block. Common regular expressions would possibly be helpful but there'd still be a lot of object creation.
Here are some initial thoughts on possible strategies:
Strategy 3 is the simplest but as someone who typically expresses languages using their aliases (eg. rb
for Ruby), I'd probably prefer 1.
I fail to see the point of using a cache registry of lexer names and aliases to selectively load a lexer since they're all unique. What is being cached..? If you want to prevent multiple loading of lexers, it'd make sense to use require
instead of load
but IIRC, @jneen was not open to making that switch.
But if you were suggesting a cache for holding common patterns in multiple lexers, that makes sense.
This cache would hold references to the filename of the relevant lexer for each name and alias. If Rouge was invoked with the name of the language (or its alias), Rouge could look it up in this cache and then load only that lexer.
Basically what require
does with $LOADED_FEATURES
. If invoking a lexer with :ruby
caused lib/rouge/lexers/ruby.rb
to be loaded into memory. invoking a lexer with :rb
would simply return false
... correct?
Strategy 3 would work like that. In your example, the language specified is :ruby
and there's a file called ruby.rb
in lib/rouge/lexers/
so that'd match and Rouge would load that and only that lexer.
The limitation would be if you used an alias. Since there's no file called rb.rb
if the language specified was :rb
, Rouge would have to fall back on loading all lexers so that a complete registry could be populated and searched instead.
Rouge would have to fall back on loading all lexers so that a complete registry could be populated and searched instead.
Oh! I did not realize that. Now your strategy sounds like a good first-step in the right direction. :+1:
Hmmm, after further review, I think I may have taken us down a rabbit hole. I suspect @jneen has (unsurprisingly) done her best to avoid unnecessary object creation.
If I'm understanding the way it works properly (not certain), here's how I think Rouge starts up:
loading lib/rouge.rb
causes all lexers to be loaded:
https://github.com/rouge-ruby/rouge/blob/d8cedb39ce3229140f1f808aeedc18b384711e61/lib/rouge.rb#L55
when a lexer is loaded, the Rouge::Lexer.tag
and Rouge::Lexer.aliases
class methods associate the tag/alias with a reference to the lexer in a hash called registry
:
in addition, when each lexer is loaded, the Rouge::RegexLexer.state
class method is invoked for each call to state
in the lexer and this in turn creates blocks (state
is always called with a block) but the objects within the blocks are not loaded:
If this current understanding is correct, my suggestion would be less useful than it initially seemed. I now don't think the objects in the block passed to state
are evaluated at load time.
So the question should probably be how big are the lexer objects themselves? If they're not that big, I'm not sure how much gain there'd be from loading them selectively.
I use a gem called memory_profiler
to gauge usage locally. Perhaps the following script will help you as well:
# frozen_string_literal: true
require 'bundler/setup' # My Gemfile is pointing to the latest rouge:master
require 'memory_profiler'
report = MemoryProfiler.report { require 'rouge' }
report.pretty_print(to_file: 'rouge-memory.tmp', scale_bytes: true)
That's helpful! My initial test showed about 10MB of memory usage in the Rouge gem just from loading. That seems like maybe a lot but maybe not. Were you seeing the same kind of profile, @ashmaroli?
Were you seeing the same kind of profile
Yes. Most of those allocations and retainments are due to loading all the lexers into memory.
For example the string reports will show you that "root"
has been allocated and retained x
times where x
equals the current no. of lexers available.
That part is interesting and may be a bug - :root
should be a symbol, which should be shared memory...
(also iirc Ruby does some memory-sharing magic with strings under a certain length, about 20-something)
@jneen wrote:
That part is interesting and may be a bug -
:root
should be a symbol, which should be shared memory...
Do you think it's because of this?
I saw it as I was writing up the explanation yesterday and my initial thought was that this is going to cause every key to be turned into a string that's then stored in memory.
Welp, that might be the culprit there. I think I was originally trying to avoid symbol/string ambiguity, but it's definitely not worth the memory bloat for internal use.
I wondered if maybe to_sym
might be a better alternative in that case? Or do you think not worth worrying about at all, really?
Since some people may have written their own lexers that aren't part of rouge, I think to_sym
is a good choice for compatibility's sake. And it's free if it's already a symbol. The only danger with to_sym
is passing it user data, since symbols are never gc'd, but that's not a problem here.
Hmmm, I suppose if we did use symbols as keys, there is one area where user data could leak in. When a user defines a code block in Markdown, they can (depending on the flavour) specify the language and that would then need to be turned into a symbol to check against the registry
hash.
The immediate solution that comes to mind is to truncate this language string before turning it into a symbol. Could be a frustrating bug to track down if you have a long language name so should probably document it if that is the approach taken.
That's true for lexer tags, but state keys should be fully invisible to the user.
B'oh—right you are! Was a bit too sleepy this morning :)
This issue has been automatically marked as stale because it has not had any activity for more than a year. It will be closed if no additional activity occurs within the next 14 days. If you would like this issue to remain open, please reply and let us know if the issue is still reproducible.
This issue has been automatically marked as stale because it has not had any activity for more than a year. It will be closed if no additional activity occurs within the next 14 days. If you would like this issue to remain open, please reply and let us know if the issue is still reproducible.
As of Rouge-3.3.0, all lexers are loaded into memory via
require 'rouge'
. Now that more and more lexers with numerousrule
s are being introduced, Rouge is going to bloat the memory unnecessarily. The chances of a consumer using a major percentage of the bundled lexers are very low.Therefore, I would like to propose to have the lexers be loaded into memory as necessary similar to how Ruby's
Module#autoload
functions.Whatsay @pyrmont, @dblessing, @gfx, @jneen?