Open joshgoebel opened 4 years ago
// _Complex, __always_inline, etc.
var UNDERSCORE_KEYWORDS_RE = ["_[A-Z][a-z\\d_]+", "__[a-z\\d_]+"]
var CPP_KEYWORDS = {
keyword: 'int float while private char catch import module export virtual operator sizeof ' +
...
}
hljs.addKeywords(CPP_KEYWORDS["keyword"], UNDERSCORE_KEYWORDS_RE)
// or
hljs.addKeyword(CPP_KEYWORDS["keyword"], UNDESCORE_KEYWORD_RE, relevance: 1)
hljs.addKeyword(CPP_KEYWORDS["keyword"], DOUBLE_UNDESCORE_KEYWORD_RE, relevance: 2)
I think I'd prefer something like the above (a mini API to let someone add keywords to the actual keywords). The would solve the problem of "not to depend on how keyword flattener works"... since we'd be free to change the keyword string format in the future (but how likely is that really?) just so long as we still supported addKeyword
... right now all addKeyword
would really do is just add the regex and a space... and it could blow up if (or do nothing) if someone included a space in their regex...
The problem with saying "just use another matcher" is (other than having to change the file perfectly in like 10 places, and then worry about TESTING that those 10 changes all work properly) that matchers actually have very different behavior from keywords... keywords are always the last to match, match the "gaps", etc...
I see lots of advantages to just piggybacking off the existing keyword system.
Thoughts?
Something like that would be OK, but I think it's a bit complicated than it looks.
all addKeyword would really do is just add the regex and a space
So what about spaces in regex? And how about set some relevance to every keyword introduced by UNDERSCORE_KEYWORDS_RE
?
For me it seems that such API should somehow add keywords directly into compiled_keywords
structure (bypassing flattener).
So what about spaces in regex?
I addressed that:
and it could blow up if (or do nothing) if someone included a space in their regex...
It could even blow up, which would simply fail to pass tests... etc...
And how about set some relevance to every keyword introduced by UNDERSCORE_KEYWORDS_RE?
Pretty easy to do that with a for loop? These special cases should be one offs, this shouldn't be something needed everyday or very often. Though I'd be fine with the API taking an array as well...
hljs.addKeyword(CPP_KEYWORDS["keyword"], [A_RE, B_RE, C_RE], relevance: 2)
For me it seems that such API should somehow add keywords directly into compiled_keywords structure (bypassing flattener).
Except the compiling happen much much later in the process... this is just the raw data file...
What if we just allow it to take a string or array?
var UNDERSCORE_KEYWORDS_RE = [/_[A-Z][a-z\d_]+/, /__[a-z\d_]+/]
var CPP_KEYWORDS = {
keyword: ['int float while private char catch import module export virtual operator sizeof ' +
'', ...UNDERSCORE_KEYWORDS_RE]
...
}
That's newer syntax which I'd have to walk back, but you get the idea... so:
I'm worried we may be overthinking this. Let me think and dig into this a bit more. :) Think there may be a few things about how we do keywords I need to bone up on anyways.
So seems keywords don't really seem to support regex anyways, so that one grammar is mistaken unless I'm reading something wrong. So I think right now the only way to do this is another constant and then repeat it over and over inside contains
in each level, just as we repeat keywords
. (as @egor-rogov originally suggested) :-)
Tagging this beginner friendly as this should just be a matter of getting it done and writing some tests.
Now that we have some precedent what about:
keywords: {
$pattern: ...
$modes: [
{ begin: /__[a-z\d_]+/, className: "keyword" }
],
builtin: ...,
literal: ...,
var UNDERSCORE_KEYWORDS_RE =
As a way of attaching/grouping modes that exist for no other purpose than to match keywords, just when the keyword is a dynamic match (regex) instead of a static list? $modes would simply be copied to the end of whatever mode's contains
list...
Just popped into my head and I thought I'd jot it down.
Or perhaps something like dynamic_pattern
would be better, no word list, if the regex matches, it's a keyword, period...
keywords: {
$dynamic_patterns: {
[/_{1,2}[a-z\d_]+/, "keyword"]
]
Or a new namespace/suffix:
keywords: {
keyword$all: /_{1,2}[a-z\d_]+/, // className = "keyword"
builtin$all: /__(.*)__/, // className = "builtin"
keyword: "if when else done",
+1 for the last option. Probably even
keywords: {
$keyword: /_{1,2}[a-z\d_]+/,
keyword: "if when else done",
Well, actually the intent when we added $pattern was to reserve the whole $ namespace for keyword configuration, not random values. So in the future you could imagine:
keywords: {
$pattern: ...,
$pcre_regex_lib: true,
$ignoreCase: true,
So $keyword MAYBE could work but not if you want to be able to name the resulting matches with className... I'm actually coming around to the idea of a _re
suffix. What do you think:
keywords: {
keyword_re: /_{1,2}[a-z\d_]+/, // className = "keyword"
builtin_re: /__(.*)__/, // className = "builtin"
keyword: "if when else done",
_re
seems unlikely to hit any real name you might want to use, and even if it did these are just internal symbols, you could rename your name slightly to say keyword_regex
...
My problem is that I don't think _re screams "wildcard" or 'match all'... we could use * but then we'd have to stringily the keys:
keywords: {
'*keyword': /_{1,2}[a-z\d_]+/, // className = "keyword"
'*builtin': /__(.*)__/, // className = "builtin"
keyword: "if when else done",
Maybe maybe just seeing the regex next to it makes _re
clear enough in its intent? Actually the one worry I have here is that people could imagine they are configuring a pattern per type of keyword - rather than a match all... we could allow that but that isn't what this is. So I kind of like the * a little better I think.
I don't think we're there yet, but I like this avenue of exploration.
@allejo Any thoughts?
Maybe just seeing the regex next to it makes _re clear enough in its intent?
Yes, it's clear enough for me. I'm fine with keyword_re
etc. I don't think *
makes the intent clearer.
I'm going to go back and review some of the grammars that make HEAVY use of modes for keywords and see if there is anything else we should keep in mind while working on this. My one thought so far is we don't have a great way to specify relevance. We could fall back on some helper functions:
import { kw } from "keyword_helpers";
keywords: {
keyword_re: kw.matchAll(/_{1,2}[a-z\d_]+/, { relevance: 2 })
Although then I become temped to just make the input processing smarter:
import { kw } from "keyword_helpers";
keywords: {
// __boo_hoo will be default to 1 relevance, bob 10, suzy 20, other _ identifiers 2
keyword: [
"bob|10 suzy|20",
/__boo.*__/,
kw.matchAll(/_{1,2}[a-z\d_]+/, { relevance: 2 })]
For what it’s worth, double-underscore names are “reserved identifiers” in C++ but they are not keywords, and don’t generally act as such (though some implementations use such identifiers as non-standard language extensions).
When reading C++ code I certainly wouldn’t expect them to be highlighted in the same way as keywords, nor even necessarily builtins (though this is certainly debatable). For one thing, the reason they are reserved is so that standard library implementors can use these names and, consequently, such implementations are peppered with them. When discussing standard library implementation code it would be extremely counter-productive to have them all displayed as keywords. For all intents and purposes they’re regular identifiers, they’re just illegal to use for “normal” users.
Special-casing them is also arbitrary, because this is only part of the actual rule regarding reserved names, which was reproduced incorrectly in the related PR. The actual rule is:
A name is reserved if
- it contains a double underscore anywhere
- it starts with a single underscore, followed by a capital letter
- it starts with a single underscore, in the global namespace
Rule 3 can’t be implemented with lexical analysis alone. My recommendation is to leave all these names alone.
Interesting.
void static delayShort(uint16_t ms) __attribute__ ((noinline));
What is __attribute__
here, something else entirely? My editor (VS CODE) highlights such things and it seems helpful. (though it seems github does not)
Scope: meta.preprocessor.macro.cpp
Rule 3 can’t be implemented with lexical analysis alone.
I'm totally open to just closing this, but also our goal has never been perfection. If we decided this was worthy of highlighting as "meta" something it wouldn't bother me that we got a few edge cases wrong... that's the cost of being a highlighter that doesn't fully understand language context.
What is
__attribute__
here, something else entirely?
As far as standard C++ is concerned, nothing.
Leaving the ivory tower for a bit, it’s a widely used compiler extension. I do think highlighting this is useful (and VS Code’s classification seems OK, if not technically correct: it’s not a macro).
The issue with these constructs is that they differ between compilers (GCC and clang know __attribute__
but not __stdcall
and MSVC is the other way round, for example). Some of these identifiers are common enough that highlighting them makes sense, but I’d list them explicitly rather than pattern-match against __.*
, since only a handful are common. (With the caveat that GCC and clang define a large list of __builtin_.*
identifiers, so that should probably be a wildcard).
Unfortunately I’m not aware of a comprehensive list of such identifiers. The closest may be this for MSVC and this for GCC (the clang list probably has lots of overlap here), and it’s not exactly helpful. Maybe we can crib them from other highlighters.
The original filed issue was trying to add:
__stdcall __try __finally
since only a handful are common.
Are those three above common? Would you mind taking a stab at what you might consider the "common" list? If it's a short list I'd be happy to just add those manually and mark this done for now. :-)
Do the common ones make sense to rise to "keyword" status, or would you still consider them built-ins?
The issue with these constructs is that they differ between compilers
I'm not sure how that's super relevant? I understand your original objection if they are not technically keywords... but say we decided to highlight them otherwise (say with built_in
or some other tag) what would be the harm in just hitting them ALL with a wildcard? Then whatever C++ flavor someone is highlighting the __whatever_thingy
would always be highlighted - without us needing to maintain an explicit list. Seems like a win?
I’m happy to collect these lists and submit a PR but I probably won’t have time before the weekend. As for the wildcard:
what would be the harm in just hitting them ALL with a wildcard?
The problem with that is that it would highlight standard library implementation code very confusingly. For example, consider (part of) the libc++ implementation of std::copy
:
__copy(_Tp* __first, _Tp* __last, _Up* __result)
{
const size_t __n = static_cast<size_t>(__last - __first);
if (__n > 0)
_VSTD::memmove(__result, __first, __n * sizeof(_Up));
return __result + __n;
}
Highlighting all these regular identifiers as builtins or keywords is just confusing and counter-productive (and no other highlighter that I tried does it, including VS Code and Vim). Admittedly this only affects very specific projects since these identifiers are generally illegal, but discussions of standard library code is a non-negligible part of websites that could potentially use highlight.js, e.g. developer blogs and Stack Overflow.
By contrast, as a C++ developer I don’t really have any expectation to see non-standard extensions treated specially, so if a __thingy
fails to be highlighted as a built-in or keyword I probably wouldn’t even notice. In other words: false negatives seem less problematic than false positives to me.
This is supported by the lack of highlighting of these names on GitHub. Even VS Code only highlights __attribute__
under very specific circumstances. In the following, only the first usage is highlighted.
int f()
__attribute__ ((noinline));
int f()
__attribute__((noinline));
Makes sense, guess I've only ever used a few of them (attribute, etc) and they always seemed pretty important to me... :-) and didn't realize the std library using __
so much for variable naming...
I probably won’t have time before the weekend.
No rush. :-) Looking forward to seeing what ya got.
Happy New Year. Still wanting to work on this one?
Definitely. Apologies, I’ve had some other personal stuff interfere. I hope I’ll get some time to work on this soon.
There is also https://github.com/highlightjs/highlight.js/pull/2954 which means C and C++ can be dealt with as separate things.
I’m happy to collect these lists and submit a PR but I probably won’t have time before the weekend.
Which weekend? ;-) I know, I know, life gets in the way... any chance perhaps you could at least handle the "collect a list" part and post it here or is that actually the hard part in your mind? :-) So far of everyone talking here you seem the most qualified to furnish a pseudo-official list of what "might be a good idea"... once we had a list someone else could come along and do the actual implementation.
Stems from PR: https://github.com/highlightjs/highlight.js/pull/2020
See the related discussion there.
It'd be really nice to find a solution to this without a separate mode. This needs a little more though IMHO to see if that's possible before just breaking down and doing that.