Closed Alexey-T closed 2 years ago
@Alexey-T I think GitLab has recently switched from rouge
to highlight.js
. See this GitLab issue for more information.
Take a close look at the issue, not even feature flagged yet (according to the checklist)... so I'd say this ball is still in Rouge's court at the present.
Thanks for looking into @joshgoebel. I have verified the status of the feature flag via our internal tooling and can confirm it is enabled globally on gitlab.com
. It is most likely that the checklist on the issue was not updated accordingly.
In my local development environment, I have experimented with the feature flag toggle on (Highlight) and off (Rouge) (see screenshot).
Rouge | Highlight |
---|---|
Looking at the source code rendered with Highlight JS, I can see that GitLab has picked up the wrong language, puppet
instead of pascal
🤔. The syntax highlight with Highlight JS looks fine outside the context of the GitLab editor.
I think this is an issue within GitLab, or at least on how it maps the file extension with the language syntax. I would raise an issue in GitLab to investigate further.
Updated: The highlight_js
is currently enabled globally on gitlab.com
. The checklist at the bottom of the issue is only executed once we are confident with the rollout and remove the old code and feature flag altogether.
Pascal extension is .pas, Puppet is .pp, so for file 'buttons.pp' detection is wrong.
Pascal extension is .pas, Puppet is .pp, so for file 'buttons.pp' detection is wrong.
For Free Pascal .pp
is a totally valid and sometimes used extension (as we see here). This is one problem with extensions as solo auto-detect heuristics, they are not high enough fidelity on their own as there is often overlap. No language truly owns any extension.
Looking at the source code rendered with Highlight JS, I can see that GitLab has picked up the wrong language, puppet instead of pascal
Thanks for digging deeper here. This would have been my next guess had I known for sure that Github HAD completed the transition as it's a common issue - people using the wrong grammar and then getting the wrong results.
This is probably a case where what GitLab should do is use extension as a limiter and then use our auto-detection abilities to make a final decision... IE:
pp
might mean : Puppet, Pascal or Pink pantherspuppet
, pascal
or pinkpanthers
...Our auto-detect often works very well in this specific type of situation with a tiny subset of "high probability" languages to decide between.
But as the issue isn't a rouge issue I think perhaps this could now be closed and the discussion moved to Highlight.js GitHub issue or a GitLab issue?
@joshgoebel Thank you for the collaboration on this issue 🙏🏼.
This is one problem with extensions as solo auto-detect heuristics, they are not high enough fidelity on their own as there is often overlap.
Totally 👍🏼 Rouge
uses a number of factors besides filename to apply the appropriate lexer. This implementation still fails in this case 🤔.
./bin/rougify guess buttons.pp
{ tag: "puppet", title: "Puppet", desc: "The Puppet configuration management language (puppetlabs.org)" }
Highlight.js on the other hand pick up the correct language.
node
> const hljs = require('highlight.js');
> const fs = require('fs');
> fs.readFile('buttons.pp', 'utf8', (err, data) => { console.log(hljs.highlightAuto(data).language) })
undefined
> delphi
But as the issue isn't a rouge issue I think perhaps this could now be closed and the discussion moved to Highlight.js GitHub issue or a GitLab issue?
Looking closer at the GitLab implementation, I think the problem is due to Rouge
and how it is wired in GitLab. It appears that GitLab uses Rouge
to detect the language and passes that information to Highlight.js
. In this particular case, Rouge
detects puppet
and the frontend does not perform further language detection. So it is back in Rouge
court I guess. I will raise an additional issue on GitLab to investigate a proper integration.
I will raise an additional issue on GitLab to investigate a proper integration.
Please link that here... I'm curious what you think a "proper integration" might look like... :-)
It seems the problem here is that Rouge isn't aware that pp
can ALSO mean Pascal...? Or is it and it just decided this was not Pascal?
Please link that here... I'm curious what you think a "proper integration" might look like... :-)
Yep, will do.
It seems the problem here is that Rouge isn't aware that
pp
can ALSO mean Pascal...? Or is it and it just decided this was not Pascal?
Of all the factors that Rouge
checks, they all return Puppet
lexer 😢 I am curious about how the auto-detection is done in Hightlight.js
🤔
Of all the factors that Rouge checks,
What are "all the factors"? It seems like your filename database is simply lacking the connection - or else you're limited in the fact that you're only allowed to return a single canonical answer instead of "it might be A or B"...?
I am curious about how the auto-detection is done in Hightlight.js
Very roughly: The parser matches "tokens", things that look like a language... comments, strings, keywords, naming conventions, operators, syntax... all these things have specific relevance values... for auto-detect we run EVERY grammar against a snippet... and the grammar that has the most relevance wins.
It's not perfect.
A better approach would be to use outside heuristics (stack overflow tags, gitlab file extensions, etc) to 'hone in' on a few possibilities, and then let us decide which of those limited choices is correct. We perform much better in that type of scenario, ie... "is this Pascal or Puppet"... vs "it could be 1 of 200, good luck"
What are "all the factors"?
Rouge
does mimetype, filename, modeline, shebang and finally some manual disambiguation steps.
Where is the code for shebangs?
Where is the code for shebangs?
In this file, https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/text_analyzer.rb
@Alexey-T FYI, we have bumped the version of Rouge in GitLab to fix the language detection for Free Pascal. Could you kindly check if the highlight works correctly this time?
It's OK now, thanks!
Name of the lexer Pascal
Code sample Lazarus file: https://gitlab.com/freepascal.org/lazarus/lazarus/-/blob/main/lcl/buttons.pp I see how it's rendered on site, is it using Rouge lib? It's wrong.
// text
and{ text }
are not pale?Additional context none.