highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.52k stars 3.57k forks source link

Discuss: Higher fidelity language highlighting (in general) #2500

Open Turnerj opened 4 years ago

Turnerj commented 4 years ago

Taking over this issue to use it as the proposal for higher fidelity highlighting as a general idea and a plan to get there - if we should (I think we should).


[Editor] Original issue content:

The short version is I want to get this (the C# snippet from the highlight.js website): image

Closer to this (same snippet but in the latest version of Visual Studio 2019): image

Actual snippet of code:

using System.IO.Compression;

#pragma warning disable 414, 3021

namespace MyApplication
{
    [Obsolete("...")]
    class Program : IInterface
    {
        public static List<int> JustDoIt(int count)
        {
            Console.WriteLine($"Hello {Name}!");
            return new List<int>(new int[] { 1, 2, 3 })
        }
    }
}

My question isn't about writing the CSS to do that for a theme, its about having classes to target those styles in the first place. Where does the use of a specific highlight.js class start and end (eg. "hljs-function" for declarations but not calls?). Like, is there a spec somewhere for when something should be used?

I've read through some of the related issues like #1731 and #1378 so I do understand reluctance for adding classes that can target these types of cases as core functionality especially when not all languages do this. I guess I'm querying, more as a guide, if I were to implement (some of) this for C#, I don't want to step on toes about class name consistency.

joshgoebel commented 4 years ago

The snaps are a helpful "big picture" but I'd suggest you need to start with breaking out a specific well defined list of which things you want to highlight... and perhaps also which classes we might use to do so. That could be a basis for a discussion. There may be one or two things we're open to, but I think you'll find most of it is off the table [for the core library] for a few different reasons (that are probably also started elsewhere):

If you provide a specific list we could talk about perhaps what we COULD improve in the core library. I for one wouldn't mind seeing function dispatch highlighted. IE coloring writeLine in the following.

Console.writeLine("...");

But even then the problem becomes which classes to use. function has always meant function declaration, not invocation.


Now, another angle is if you want to branch off on your own. Fork the C# grammar, provide a custom theme (with all new CSS classes), and publish a 3rd party grammar module. Then all possibilities are open to you (within the technical constraints of what the engine itself can support)... This likely will require time and ongoing maintenance of course...

Honestly I think the best outcome for many languages would be if the language community got super interested and maintained their own 3rd party grammars rather than all the grammars being first-party.

joshgoebel commented 4 years ago

That said the VS Code is all OSS. If you want syntax highlighting "that good" you might want to look at just using the same code Visual Studio uses to do the highlighting. But their syntax file for some languages is like 1mb while ours is like 10kb.... so that level of detail comes at a price.

joshgoebel commented 4 years ago

Also if you attached plain-text versions for anyone who wanted to play with this that'd be greatly appreciated.

Turnerj commented 4 years ago

That said the VS Code is all OSS. If you want syntax highlighting "that good" you might want to look at just using the same code Visual Studio uses to do the highlighting. But their syntax file for some languages is like 1mb while ours is like 10kb.... so that level of detail comes at a price.

Yeah, there are definitely trade-offs if I wanted to match my example images 100%. If I can bring them a little closer though that would be great. It was just funny seeing a code snippet of mine, with highlighting enabled, seem like nothing really was highlighted except for like 1 if-statement.

Of the different constructs, I think function calls are probably the easiest with a Regex similar to (?<functionCall>[a-zA-Z0-9_]+)\( (though it would capture this [Obsolete("...")] erroneously as a function call). I'm thinking maybe avoiding the re-use of hljs-function to keep better backwards compatibility and instead could have hljs-invocation as it is exactly what we are doing. It does mean other uses of that function wouldn't work (eg. highlighting of nameof(myFunctionName) would leave myFunctionName unhighlighted but this goes back to your point above).

For doing things like how return is a different colour - it might be simple enough to have just different categories of keywords. Could then use two classes of hljs-keyword and additionally hljs-keyword-N where N is the category/group index. That way this isn't coded too specifically to a language construct, just an extension of how the existing code base works. While saying this though, return might be the only case in C# like that which I can recall off the top of my head so this entirely might be overkill.

The other cases start to get a bit more difficult as they rely on greater context and are more language specific. Like its easy enough to have a Regex pick up "IInterface" as C# follows the convention of an "I" prefix on interfaces however many languages don't have interfaces. Then are we splitting declaration of an interface (public interface IInterface) from its use like a cast (IInterface)myObj? I could live without this being specifically detected.

One thought I had was also potentially separating out and/or allowing extension of "built_in". That is, while the built-in types might work well from something like JS, if we counted the entire System namespace in C# as built-in (which it effectively is), there would be hundreds of items which would be a huge bloat to the language definition. Being able to define or extend it to be more like "known types" as a separate package could allow flexibility both ways.

So after writing all this, I think really two specific pieces would be helpful and easy enough to approach:

Turnerj commented 4 years ago

Also if you attached plain-text versions for anyone who wanted to play with this that'd be greatly appreciated.

Not entirely sure what you mean here. Like I know what "plain text" is but are you referring to the actual code snippet from the earlier screenshots and just wanting a non-image copy of it? Happy to do that, just not sure if that's what you meant.

joshgoebel commented 4 years ago

but are you referring to the actual code snippet from the earlier screenshots

Yes.

joshgoebel commented 4 years ago

I wonder what theme you were using there since many of the "titles" are indeed highlighed.

Screen Shot 2020-04-21 at 10 56 51 PM
joshgoebel commented 4 years ago

Extending built-in/known types (initially I could just add super common ones like List or Math though ideally would still want some more exotic - but still built-in - types to be included)

Definitely open to this. One of the easiest things, for sure.

Like its easy enough to have a Regex pick up "IInterface" as C# follows the convention of an "I" prefix on interfaces

Seems reasonable, except we already seem to highlight interfaces as titles from where I'm sitting.

and instead could have hljs-invocation as it is exactly what we are doing.

Naming sounds reasonable but I can't see adding it to core unless we had some automated way of updating all the existing themes... like perhaps copying another style or some sort of automated "adjustment" of styles... ie invocation is just like xyz, but 20% brighter, dimmer, etc...

There is no point in a new class (in core) if only 1% of the themes support it.

One thought I had was also potentially separating out and/or allowing extension of "built_in". That is, while the built-in types might work well from something like JS, if we counted the entire System namespace in C# as built-in (which it effectively is), there would be hundreds of items which would be a huge bloat to the language definition. Being able to define or extend it to be more like "known types" as a separate package could allow flexibility both ways.

I think this is best answered by someone forking C# and making a much better version (and maintaining it) - and then ours is removed or becomes the "basic" version. Though technically it's already possible to do this with requireLanguage. Look at Arduino for a simple example of how this can be done.

If the changes were significant though it'd be easy to start from scratch or fork though.

joshgoebel commented 4 years ago
#pragma warning disable 414, 3021

This might be simple too, but I'm not sure what the rules are... would numbers in a compiler directive always be highlighted or only for certain directives?

Turnerj commented 4 years ago

I wonder what theme you were using there since many of the "titles" are indeed highlighed.

That was the VS2015 theme.

May I ask what you used to generate that image with the class names displayed inline like that? Looks like it will totally help me in my own debugging.

I guess some of the titles are different to others - eg "namespace MyNamespace" vs "class MyClass" vs "interface IMyInterface". If a language parser could configure multiple class names where all of those are hljs-title's but the namespace might be hljs-level-1 with the class and interface definitions as hljs-level-2 or something. This way a theme can still style appropriately while not necessarily being aware of any specific language construct.

That said, while it is generic and would allow me to target what I want, it still feels a little bit of a kludge and probably not a good idea in the core.

Naming sounds reasonable but I can't see adding it to core unless we had some automated way of updating all the existing themes... like perhaps copying another style or some sort of automated "adjustment" of styles... ie invocation is just like xyz, but 20% brighter, dimmer, etc...

There is no point in a new class (in core) if only 1% of the themes support it.

I kinda see it more like - some of these themes were never designed with additional constructs/language pieces in mind. Actually applying a styling (even just brighter or dimmer) might be a bad thing whereas having a class available for a theme to use (if it was appropriate for the theme) might be a good thing.


I might have a mess around with the C# parser, at least adding some built-in types for C# if nothing else is easily approachable.

joshgoebel commented 4 years ago

May I ask what you used to generate that image with the class names displayed inline like that? Looks like it will totally help me in my own debugging.

tools/developer.html

If a language parser could configure multiple class names where all of those are hljs-title's but the namespace might be hljs-level-1 with the class and interface definitions as hljs-level-2 or something.

If we have an existing class that makes sense we might have room for some variety - there might be room to detect the titles differently, but I worry we may not.

I kinda see it more like - some of these themes were never designed with additional constructs/language pieces in mind.

That's why you'd use one of the existing styles (the closest match) and either copy it verbatim or have an algorithmic way of changing it subtly.

class available for a theme to use (if it was appropriate for the theme) might be a good thing.

3rd party language modules are free to innovate here, but I don't see a lot of room in core because I don't like the idea of "blessed" themes vs "old" themes.

To be clear: I'm not saying "never ever". Just if we're to add CSS classes to core it would require a reasonable plan to deal with old themes in some way other than saying "oh most of the themes just don't support those new classes"...

joshgoebel commented 4 years ago

One idea would be for a few new classes that add "nuance" to existing ones... or for grammars to add nuance that themes COULD support. title.interface, ala how TextMate grammars work. Then all themes would still highlight it as a title, but some might highlight it with more nuance.

That might be worth discussing.

It might be interesting to take a complex piece of C#/JS/TS code and spit out all the Textmate scopes involved and then see what that looks like.

Turnerj commented 4 years ago

Thanks for that debugging link!

3rd party language modules are free to innovate here, but I don't see a lot of room in core because I don't like the idea of "blessed" themes vs "old" themes.

That's totally fair! There are a ton of first-party supported themes which is a blessing and a curse in cases like this.

One idea would be for a few new classes that add "nuance" to existing ones... or for grammars to add nuance that themes COULD support. title.interface, ala how TextMate grammars work. Then all themes would still highlight it as a title, but some might highlight it with more nuance.

I'm relatively unfamiliar with TextMate however that sounds like what I'm thinking. I wouldn't want a grammar to not be supplying the common/existing class - it should 100% provide that (for backwards compatibility + simplicity for more "basic"/light themes) and an optional something else that can provide it a little more context.

joshgoebel commented 4 years ago

Related: https://github.com/highlightjs/highlight.js/issues/2521

Turnerj commented 4 years ago

I love the proposal you've written up into that issue - would it be best to close this one and have that as the canonical reference for this type of functionality going forward?

joshgoebel commented 4 years ago

If you think it covers everything you intended to and don't see a need for this issue anymore then yes, please close. :)

joshgoebel commented 4 years ago

@Turnerj Are you actually interested in putting in some real effort on this for the csharp grammar? Could that effort extend to working with several themes to see about adding support for classes we already support but aren't broadly supported in themes? Like operator?

Many of our themes are copies or derivatives of other themes and it'd be fairly easy (but time consuming) to track down the colors they used for more nuanced highlighting and then port them back into our themes.

I think to make any real movement on this we need:

IE, a larger effort to make "high fidelity" work.. not just a single language and a single theme doing something "crazy"... but all of this takes time.

Turnerj commented 4 years ago

I'm definitely willing to help out in getting movement on this though will likely need guidance etc on it as I really don't do much JS development anymore (eg. easiest path for testing/debugging grammars and themes). If there was a framework/example that I can base my changes on (eg. if you have done it for JS, I can base my changes on C# on that with any tweaks I need to for the Regex etc) and similarly with a handful of themes.

My personal priority for themes would be starting on the VS theme (I know, a little selfish...) but happy to do 5 or 6 other themes after that so more themes support the fidelity. Do you know which ones are the most popular?

Happy to be part of the discussions as well for working out the different classes/class extensions to aim for too.

Additionally if there are any docs we need to write or update afterwards, happy to help out in that aspect too.

joshgoebel commented 4 years ago

@egor-rogov @isagalaev

Are either of you diametrically opposed to this direction? I know philosophically we've avoiding things like highlighting operators... ...except we do it for ReasonML - and PowerShell, it's even documented in CSS classes! And we've philosophically avoided highlighting variables (the whole PHP debacle)... ...except we actually highlight variables in MANY first party language grammars (48 results for "variable" in 37 files). What is that 20%?

I'd like to see us change direction on both these issues and see what it would take to increase the fidelity of our highlighting to get closer to what the major editors can do. I'm not talking about 1mb language definition files or writing real parsers that understand the grammar or anything crazy. I'm talking about small little changes that go a long way:

This may require creating a few additional (or additive) CSS classes. See #2521.

While doing this I'd like to ensure:

This should be a good start to prove what we're doing has real benefit and is 'possible' without making a mess of everything. But I don't want to jump into this if everyone isn't behind it.

joshgoebel commented 4 years ago

CC @allejo

joshgoebel commented 4 years ago

IE that we get a lot closer to this:

Screen Shot 2020-05-20 at 8 46 24 AM

Than this:

Screen Shot 2020-05-20 at 8 46 29 AM

Sorry for different themes.

joshgoebel commented 4 years ago

I think it's important to keep in mind if someone wants a 'lower fidelity' experience for some reason it's trivial to edit a CSS theme to turn off highlighting for any items they disliked.

Usually all our requests are for higher fidelity highlighting. Highlight better, not less.

egor-rogov commented 4 years ago

I'm not against it at all. The plan you proposed looks sane and realistic. I'll try to join this effort from sql and pgsql angle.

joshgoebel commented 4 years ago

I'll jump in then after I get the JS/TS foundational stuff resolved - so then I'm only doing the work for JS/TS once. :-)

Stanzilla commented 4 years ago

Highly suggest reading https://raphlinus.github.io/xi/2020/06/27/xi-retrospective.html on this topic.

Maybe there is a way to collaborate with https://marijnhaverbeke.nl/blog/lezer.html Lezer.

joshgoebel commented 4 years ago

All this is super interesting (and I played with the xi editor long ago) - thanks for the links, but I'm not sure what any real-life take-aways might be for Highlight.js. I can't see where anyone has the time right now to entirely replace our built-in parser with "something new" and then update our legacy of 184 core languages + however many 3rd party ones makes this a very large ship to turn like that.

And if someone merely wanted to "add a choice" of say using Lezer with Highlight.js... ie, let Lezer build a parse tree and then HLJS turns it into HLJS compatible HTML at the very end... I think we may have the tools right now for plugins to build out this kind of functionality on their own - if anyone wanted to play with the idea. Or at least the codebase is modular enough to allow someone who wanted to try this kind of thing to make a real go of it.

joshgoebel commented 1 year ago

Related: #3623 #3619

If we potentially have the ability to use other parsing engines, should we consider allowing them in core? Right now the price is high for Lezer (24kb-40kb compressed, depending on how much of the Lezer engine we ship vs require to be external), but much lower for Prism (which is already a tiny engine)...

The 24kb cost is JUST the Lezer JSX/TSX parser. 40kb is if we add the whole Lezer stack (at least the pieces we need). This is tersed + gziped.

I'm just thinking if our own JSX support is broken/inadequate... and other engines already have "much better" support, should we really be spending time/effort to reinvent the wheel (which our parsing engine might not even yet have the chops to do) ourselves vs just taking that existing support and tapping into it?

For example Prism core + HTML, JS, JSX, TSX is only 20kb... our own JS and TS support weighs in at around 16kb... (both uncompressed) and Prism currently seems to have much better JSX support since they dynamically rewrite the nodes after parsing to deal with balancing HTML tags, etc...

I'm super interested in making such things possible (outside of core)... but if then we have very superior JSX solutions outside of core - what does that say about the JSX support we build into core? Does it just live forever in it's suboptimal state, or would we just drop JSX/TSX support entirely and point to the external options that do them much better?

joshgoebel commented 1 year ago

Also worth noting using something like Lezer we can have incredible highlighting fidelity - since a proper Lezer grammar knows what even single token is... which is why I mention it here. So lets try to keep this discussion focused on the relationship with fidelity.

For comments on just using other parsers in general the two other related threads would be better places to discuss.