highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.59k stars 3.58k forks source link

(cpp) Support __ keywords #2173

Open joshgoebel opened 4 years ago

joshgoebel commented 4 years ago

Stems from PR: https://github.com/highlightjs/highlight.js/pull/2020

See the related discussion there.

I'd prefer a separate mode, something like

So someone has to repeat that within every scope in the syntax file, yes? So if KEYWORDS is used 10 times, then they'd need to add that rule to all 50 place also, correct?

It'd be really nice to find a solution to this without a separate mode. This needs a little more though IMHO to see if that's possible before just breaking down and doing that.

joshgoebel commented 4 years ago
// _Complex, __always_inline, etc.
var UNDERSCORE_KEYWORDS_RE = ["_[A-Z][a-z\\d_]+", "__[a-z\\d_]+"]

var CPP_KEYWORDS = {
   keyword: 'int float while private char catch import module export virtual operator sizeof ' +
   ...
}

hljs.addKeywords(CPP_KEYWORDS["keyword"], UNDERSCORE_KEYWORDS_RE)
// or 
hljs.addKeyword(CPP_KEYWORDS["keyword"], UNDESCORE_KEYWORD_RE, relevance: 1)
hljs.addKeyword(CPP_KEYWORDS["keyword"], DOUBLE_UNDESCORE_KEYWORD_RE, relevance: 2)

I think I'd prefer something like the above (a mini API to let someone add keywords to the actual keywords). The would solve the problem of "not to depend on how keyword flattener works"... since we'd be free to change the keyword string format in the future (but how likely is that really?) just so long as we still supported addKeyword... right now all addKeyword would really do is just add the regex and a space... and it could blow up if (or do nothing) if someone included a space in their regex...

The problem with saying "just use another matcher" is (other than having to change the file perfectly in like 10 places, and then worry about TESTING that those 10 changes all work properly) that matchers actually have very different behavior from keywords... keywords are always the last to match, match the "gaps", etc...

I see lots of advantages to just piggybacking off the existing keyword system.

Thoughts?

egor-rogov commented 4 years ago

Something like that would be OK, but I think it's a bit complicated than it looks.

all addKeyword would really do is just add the regex and a space

So what about spaces in regex? And how about set some relevance to every keyword introduced by UNDERSCORE_KEYWORDS_RE? For me it seems that such API should somehow add keywords directly into compiled_keywords structure (bypassing flattener).

joshgoebel commented 4 years ago

So what about spaces in regex?

I addressed that:

and it could blow up if (or do nothing) if someone included a space in their regex...

It could even blow up, which would simply fail to pass tests... etc...

And how about set some relevance to every keyword introduced by UNDERSCORE_KEYWORDS_RE?

Pretty easy to do that with a for loop? These special cases should be one offs, this shouldn't be something needed everyday or very often. Though I'd be fine with the API taking an array as well...

hljs.addKeyword(CPP_KEYWORDS["keyword"], [A_RE, B_RE, C_RE], relevance: 2)

For me it seems that such API should somehow add keywords directly into compiled_keywords structure (bypassing flattener).

Except the compiling happen much much later in the process... this is just the raw data file...

joshgoebel commented 4 years ago

What if we just allow it to take a string or array?

var UNDERSCORE_KEYWORDS_RE = [/_[A-Z][a-z\d_]+/, /__[a-z\d_]+/]

var CPP_KEYWORDS = {
   keyword: ['int float while private char catch import module export virtual operator sizeof ' +
      '', ...UNDERSCORE_KEYWORDS_RE]
   ...
}

That's newer syntax which I'd have to walk back, but you get the idea... so:

joshgoebel commented 4 years ago

I'm worried we may be overthinking this. Let me think and dig into this a bit more. :) Think there may be a few things about how we do keywords I need to bone up on anyways.

joshgoebel commented 4 years ago

So seems keywords don't really seem to support regex anyways, so that one grammar is mistaken unless I'm reading something wrong. So I think right now the only way to do this is another constant and then repeat it over and over inside contains in each level, just as we repeat keywords. (as @egor-rogov originally suggested) :-)

Tagging this beginner friendly as this should just be a matter of getting it done and writing some tests.

joshgoebel commented 4 years ago

Now that we have some precedent what about:

keywords: {
  $pattern: ...
  $modes: [
     { begin: /__[a-z\d_]+/, className: "keyword" }
  ],
  builtin: ...,
  literal: ...,
var UNDERSCORE_KEYWORDS_RE = 

As a way of attaching/grouping modes that exist for no other purpose than to match keywords, just when the keyword is a dynamic match (regex) instead of a static list? $modes would simply be copied to the end of whatever mode's contains list...

Just popped into my head and I thought I'd jot it down.

joshgoebel commented 4 years ago

Or perhaps something like dynamic_pattern would be better, no word list, if the regex matches, it's a keyword, period...

keywords: {
  $dynamic_patterns: {
     [/_{1,2}[a-z\d_]+/, "keyword"]
  ]

Or a new namespace/suffix:

keywords: {
  keyword$all: /_{1,2}[a-z\d_]+/, // className = "keyword"
  builtin$all: /__(.*)__/, // className = "builtin"
  keyword: "if when else done",
egor-rogov commented 4 years ago

+1 for the last option. Probably even

keywords: {
  $keyword: /_{1,2}[a-z\d_]+/,
  keyword: "if when else done",
joshgoebel commented 4 years ago

Well, actually the intent when we added $pattern was to reserve the whole $ namespace for keyword configuration, not random values. So in the future you could imagine:

keywords: {
  $pattern: ..., 
  $pcre_regex_lib: true,
  $ignoreCase: true,

So $keyword MAYBE could work but not if you want to be able to name the resulting matches with className... I'm actually coming around to the idea of a _re suffix. What do you think:

keywords: {
  keyword_re: /_{1,2}[a-z\d_]+/, // className = "keyword"
  builtin_re: /__(.*)__/, // className = "builtin"
  keyword: "if when else done",

_re seems unlikely to hit any real name you might want to use, and even if it did these are just internal symbols, you could rename your name slightly to say keyword_regex...

My problem is that I don't think _re screams "wildcard" or 'match all'... we could use * but then we'd have to stringily the keys:

keywords: {
  '*keyword': /_{1,2}[a-z\d_]+/, // className = "keyword"
  '*builtin': /__(.*)__/, // className = "builtin"
  keyword: "if when else done",
joshgoebel commented 4 years ago

Maybe maybe just seeing the regex next to it makes _re clear enough in its intent? Actually the one worry I have here is that people could imagine they are configuring a pattern per type of keyword - rather than a match all... we could allow that but that isn't what this is. So I kind of like the * a little better I think.

I don't think we're there yet, but I like this avenue of exploration.

@allejo Any thoughts?

egor-rogov commented 4 years ago

Maybe just seeing the regex next to it makes _re clear enough in its intent?

Yes, it's clear enough for me. I'm fine with keyword_re etc. I don't think * makes the intent clearer.

joshgoebel commented 4 years ago

I'm going to go back and review some of the grammars that make HEAVY use of modes for keywords and see if there is anything else we should keep in mind while working on this. My one thought so far is we don't have a great way to specify relevance. We could fall back on some helper functions:

import { kw } from "keyword_helpers";

keywords: {
  keyword_re: kw.matchAll(/_{1,2}[a-z\d_]+/, { relevance: 2 })

Although then I become temped to just make the input processing smarter:

import { kw } from "keyword_helpers";

keywords: {
  // __boo_hoo will be default to 1 relevance, bob 10, suzy 20, other _ identifiers 2
  keyword: [
    "bob|10 suzy|20", 
    /__boo.*__/, 
    kw.matchAll(/_{1,2}[a-z\d_]+/, { relevance: 2 })]
klmr commented 4 years ago

For what it’s worth, double-underscore names are “reserved identifiers” in C++ but they are not keywords, and don’t generally act as such (though some implementations use such identifiers as non-standard language extensions).

When reading C++ code I certainly wouldn’t expect them to be highlighted in the same way as keywords, nor even necessarily builtins (though this is certainly debatable). For one thing, the reason they are reserved is so that standard library implementors can use these names and, consequently, such implementations are peppered with them. When discussing standard library implementation code it would be extremely counter-productive to have them all displayed as keywords. For all intents and purposes they’re regular identifiers, they’re just illegal to use for “normal” users.

Special-casing them is also arbitrary, because this is only part of the actual rule regarding reserved names, which was reproduced incorrectly in the related PR. The actual rule is:

A name is reserved if

  1. it contains a double underscore anywhere
  2. it starts with a single underscore, followed by a capital letter
  3. it starts with a single underscore, in the global namespace

Rule 3 can’t be implemented with lexical analysis alone. My recommendation is to leave all these names alone.

joshgoebel commented 4 years ago

Interesting.

void static delayShort(uint16_t ms) __attribute__ ((noinline));

What is __attribute__ here, something else entirely? My editor (VS CODE) highlights such things and it seems helpful. (though it seems github does not)

Scope: meta.preprocessor.macro.cpp

joshgoebel commented 4 years ago

Rule 3 can’t be implemented with lexical analysis alone.

I'm totally open to just closing this, but also our goal has never been perfection. If we decided this was worthy of highlighting as "meta" something it wouldn't bother me that we got a few edge cases wrong... that's the cost of being a highlighter that doesn't fully understand language context.

klmr commented 4 years ago

What is __attribute__ here, something else entirely?

As far as standard C++ is concerned, nothing.

Leaving the ivory tower for a bit, it’s a widely used compiler extension. I do think highlighting this is useful (and VS Code’s classification seems OK, if not technically correct: it’s not a macro).

The issue with these constructs is that they differ between compilers (GCC and clang know __attribute__ but not __stdcall and MSVC is the other way round, for example). Some of these identifiers are common enough that highlighting them makes sense, but I’d list them explicitly rather than pattern-match against __.*, since only a handful are common. (With the caveat that GCC and clang define a large list of __builtin_.* identifiers, so that should probably be a wildcard).

Unfortunately I’m not aware of a comprehensive list of such identifiers. The closest may be this for MSVC and this for GCC (the clang list probably has lots of overlap here), and it’s not exactly helpful. Maybe we can crib them from other highlighters.

joshgoebel commented 4 years ago

The original filed issue was trying to add:

__stdcall __try __finally

since only a handful are common.

Are those three above common? Would you mind taking a stab at what you might consider the "common" list? If it's a short list I'd be happy to just add those manually and mark this done for now. :-)

Do the common ones make sense to rise to "keyword" status, or would you still consider them built-ins?

The issue with these constructs is that they differ between compilers

I'm not sure how that's super relevant? I understand your original objection if they are not technically keywords... but say we decided to highlight them otherwise (say with built_in or some other tag) what would be the harm in just hitting them ALL with a wildcard? Then whatever C++ flavor someone is highlighting the __whatever_thingy would always be highlighted - without us needing to maintain an explicit list. Seems like a win?

klmr commented 4 years ago

I’m happy to collect these lists and submit a PR but I probably won’t have time before the weekend. As for the wildcard:

what would be the harm in just hitting them ALL with a wildcard?

The problem with that is that it would highlight standard library implementation code very confusingly. For example, consider (part of) the libc++ implementation of std::copy:

__copy(_Tp* __first, _Tp* __last, _Up* __result)
{
    const size_t __n = static_cast<size_t>(__last - __first);
    if (__n > 0)
        _VSTD::memmove(__result, __first, __n * sizeof(_Up));
    return __result + __n;
}

Highlighting all these regular identifiers as builtins or keywords is just confusing and counter-productive (and no other highlighter that I tried does it, including VS Code and Vim). Admittedly this only affects very specific projects since these identifiers are generally illegal, but discussions of standard library code is a non-negligible part of websites that could potentially use highlight.js, e.g. developer blogs and Stack Overflow.

By contrast, as a C++ developer I don’t really have any expectation to see non-standard extensions treated specially, so if a __thingy fails to be highlighted as a built-in or keyword I probably wouldn’t even notice. In other words: false negatives seem less problematic than false positives to me.

This is supported by the lack of highlighting of these names on GitHub. Even VS Code only highlights __attribute__ under very specific circumstances. In the following, only the first usage is highlighted.

int f()
__attribute__ ((noinline));

int f()
__attribute__((noinline));
joshgoebel commented 4 years ago

Makes sense, guess I've only ever used a few of them (attribute, etc) and they always seemed pretty important to me... :-) and didn't realize the std library using __ so much for variable naming...

I probably won’t have time before the weekend.

No rush. :-) Looking forward to seeing what ya got.

joshgoebel commented 3 years ago

Happy New Year. Still wanting to work on this one?

klmr commented 3 years ago

Definitely. Apologies, I’ve had some other personal stuff interfere. I hope I’ll get some time to work on this soon.

joshgoebel commented 3 years ago

There is also https://github.com/highlightjs/highlight.js/pull/2954 which means C and C++ can be dealt with as separate things.

joshgoebel commented 3 years ago

I’m happy to collect these lists and submit a PR but I probably won’t have time before the weekend.

Which weekend? ;-) I know, I know, life gets in the way... any chance perhaps you could at least handle the "collect a list" part and post it here or is that actually the hard part in your mind? :-) So far of everyone talking here you seem the most qualified to furnish a pseudo-official list of what "might be a good idea"... once we had a list someone else could come along and do the actual implementation.