rouge-ruby / rouge

A pure Ruby code highlighter that is compatible with Pygments
https://rouge.jneen.net/
Other
3.33k stars 735 forks source link

Pascal lexer: strange highlight of buttons.pp #1842

Closed Alexey-T closed 2 years ago

Alexey-T commented 2 years ago

Name of the lexer Pascal

Code sample Lazarus file: https://gitlab.com/freepascal.org/lazarus/lazarus/-/blob/main/lcl/buttons.pp I see how it's rendered on site, is it using Rouge lib? It's wrong.

  1. why "types" id is hilited?
  2. why "type" keyword is not hilited?
  3. why "function"/"procedure"/"constructor"/"destructor" keywords are not hilited?
  4. why comments // text and { text } are not pale?

Additional context none.

tancnle commented 2 years ago

@Alexey-T I think GitLab has recently switched from rouge to highlight.js. See this GitLab issue for more information.

joshgoebel commented 2 years ago

Take a close look at the issue, not even feature flagged yet (according to the checklist)... so I'd say this ball is still in Rouge's court at the present.

tancnle commented 2 years ago

Thanks for looking into @joshgoebel. I have verified the status of the feature flag via our internal tooling and can confirm it is enabled globally on gitlab.com. It is most likely that the checklist on the issue was not updated accordingly.

In my local development environment, I have experimented with the feature flag toggle on (Highlight) and off (Rouge) (see screenshot).

Rouge Highlight
rouge highlight

Looking at the source code rendered with Highlight JS, I can see that GitLab has picked up the wrong language, puppet instead of pascal 🤔. The syntax highlight with Highlight JS looks fine outside the context of the GitLab editor.

puppet

I think this is an issue within GitLab, or at least on how it maps the file extension with the language syntax. I would raise an issue in GitLab to investigate further.

Updated: The highlight_js is currently enabled globally on gitlab.com. The checklist at the bottom of the issue is only executed once we are confident with the rollout and remove the old code and feature flag altogether.

Alexey-T commented 2 years ago

Pascal extension is .pas, Puppet is .pp, so for file 'buttons.pp' detection is wrong.

joshgoebel commented 2 years ago

Pascal extension is .pas, Puppet is .pp, so for file 'buttons.pp' detection is wrong.

For Free Pascal .pp is a totally valid and sometimes used extension (as we see here). This is one problem with extensions as solo auto-detect heuristics, they are not high enough fidelity on their own as there is often overlap. No language truly owns any extension.

Looking at the source code rendered with Highlight JS, I can see that GitLab has picked up the wrong language, puppet instead of pascal

Thanks for digging deeper here. This would have been my next guess had I known for sure that Github HAD completed the transition as it's a common issue - people using the wrong grammar and then getting the wrong results.


This is probably a case where what GitLab should do is use extension as a limiter and then use our auto-detection abilities to make a final decision... IE:

Our auto-detect often works very well in this specific type of situation with a tiny subset of "high probability" languages to decide between.

joshgoebel commented 2 years ago

But as the issue isn't a rouge issue I think perhaps this could now be closed and the discussion moved to Highlight.js GitHub issue or a GitLab issue?

tancnle commented 2 years ago

@joshgoebel Thank you for the collaboration on this issue 🙏🏼.

This is one problem with extensions as solo auto-detect heuristics, they are not high enough fidelity on their own as there is often overlap.

Totally 👍🏼 Rouge uses a number of factors besides filename to apply the appropriate lexer. This implementation still fails in this case 🤔.

./bin/rougify guess buttons.pp
{ tag: "puppet", title: "Puppet", desc: "The Puppet configuration management language (puppetlabs.org)" }

Highlight.js on the other hand pick up the correct language.

node
> const hljs = require('highlight.js');
> const fs = require('fs');
> fs.readFile('buttons.pp', 'utf8', (err, data) => { console.log(hljs.highlightAuto(data).language) })
undefined
> delphi

But as the issue isn't a rouge issue I think perhaps this could now be closed and the discussion moved to Highlight.js GitHub issue or a GitLab issue?

Looking closer at the GitLab implementation, I think the problem is due to Rouge and how it is wired in GitLab. It appears that GitLab uses Rouge to detect the language and passes that information to Highlight.js. In this particular case, Rouge detects puppet and the frontend does not perform further language detection. So it is back in Rouge court I guess. I will raise an additional issue on GitLab to investigate a proper integration.

joshgoebel commented 2 years ago

I will raise an additional issue on GitLab to investigate a proper integration.

Please link that here... I'm curious what you think a "proper integration" might look like... :-)

It seems the problem here is that Rouge isn't aware that pp can ALSO mean Pascal...? Or is it and it just decided this was not Pascal?

tancnle commented 2 years ago

Please link that here... I'm curious what you think a "proper integration" might look like... :-)

Yep, will do.

It seems the problem here is that Rouge isn't aware that pp can ALSO mean Pascal...? Or is it and it just decided this was not Pascal?

Of all the factors that Rouge checks, they all return Puppet lexer 😢 I am curious about how the auto-detection is done in Hightlight.js 🤔

joshgoebel commented 2 years ago

Of all the factors that Rouge checks,

What are "all the factors"? It seems like your filename database is simply lacking the connection - or else you're limited in the fact that you're only allowed to return a single canonical answer instead of "it might be A or B"...?

I am curious about how the auto-detection is done in Hightlight.js

Very roughly: The parser matches "tokens", things that look like a language... comments, strings, keywords, naming conventions, operators, syntax... all these things have specific relevance values... for auto-detect we run EVERY grammar against a snippet... and the grammar that has the most relevance wins.

It's not perfect.

joshgoebel commented 2 years ago

A better approach would be to use outside heuristics (stack overflow tags, gitlab file extensions, etc) to 'hone in' on a few possibilities, and then let us decide which of those limited choices is correct. We perform much better in that type of scenario, ie... "is this Pascal or Puppet"... vs "it could be 1 of 200, good luck"

tancnle commented 2 years ago

What are "all the factors"?

Rouge does mimetype, filename, modeline, shebang and finally some manual disambiguation steps.

joshgoebel commented 2 years ago

Where is the code for shebangs?

tancnle commented 2 years ago

Where is the code for shebangs?

In this file, https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/text_analyzer.rb

tancnle commented 2 years ago

@Alexey-T FYI, we have bumped the version of Rouge in GitLab to fix the language detection for Free Pascal. Could you kindly check if the highlight works correctly this time?

Alexey-T commented 2 years ago

It's OK now, thanks!