highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.77k stars 3.61k forks source link

Idea: Make it easy to plug-in Prism or VS Code/TextMate grammars. #2212

Closed joshgoebel closed 8 months ago

joshgoebel commented 5 years ago

Brilliant idea or terrible idea? Would this enhance the ecosystem or destroy it?

foo123 commented 5 years ago

Check out this addon https://github.com/foo123/highlightjs-grammar which works for all syntax highlighters and enables you to define a grammar specification for a language and use it as is in every highlighter of your choice (including highlightjs and prism)

joshgoebel commented 5 years ago

and enables you to define a grammar specification for a language

I think that is the hard part unless you plan to publish and maintain a huge repository of grammars yourself. This is why (most) people come to a library I think, for the supported languages. :-) For most people saying "you just have to write your own grammar" is almost as bad as "you just have to write your own parser". :-)

What we'd gain (if say we supported Prism grammars) is instant support for all the languages that Prism supports that we do not. The question is might that hurt us by causing fewer people to want to developer Highlight.js grammars in the first place...

foo123 commented 5 years ago

@yyyc514 of course you have a point in that grammars need to be written themselves, else it would be good if a repository of grammars exists. I dont mind having such a repository (maybe in the future) and the repos themselves have already made grammars for some popular languages (eg php, python, css, javascript, html, etc.) as examples on how to create fully-functional grammars.

But again the point of the grammar addons started through my own use cases where I had to highlight custom languages or dsl's that were not supported by any known language. And it would be a pain in the neck if I had to create parsers all the time, especialy if the parser was tied to a specific highlighter and had to make a new one for another highlighter. So highlighting custom languages was the starting point of the grammar addons.

Of course creating a grammar requires a minimal knowledge (eg BNF specifications) but again it is very minimal knowledge than writing a full parser and having to delve into the internals of a syntax highlight framework. So I think it is a win-win situtation.

Again all these are optional, only for those who benefit from such an approach, the others can use what exists already as before. This is only an addon, it is not necesarily a replacer of existing frameworks or languages.

joshgoebel commented 5 years ago

BNF spec... how deep does that get you? Are you really just lexing mostly or is it contextual enough to know you're inside a class, that this is a function call, not a variable definition (they can look the same in C, etc), etc?

I'd be curious to see a BNF for JSX or something complicated and see how well it worked compared to say what we currently have. Probably things we could learn.

foo123 commented 5 years ago

The parsers created by the grammar addons are similar to PEGs which are mildly stronger than context-free parsers. But there are other features (like grammar actions) which enable strong context-sensitive parsing (not simply context-free).

Take a look at the xml grammar which not only highlights xml properly but detects errors, duplicate attributes, duplicate ids and so on..

The grammar reference is here

It is highly versatile, and eases the burden a lot in creating highly functional and feature-free parsers with only minimal knowledge (and in fact knowledge about a language that is needed anyway, eg what is a keyword, an identifier, a comment for that language and so on..).

Of course someone may need a feature not supported by grammar specification or even not possible at all by a grammar specification (and in that case would need to write his own highly detailed parser), but I doubt one will be found.

joshgoebel commented 5 years ago

Take a look at the xml grammar which not only highlights xml properly but detects errors, duplicate attributes, duplicate ids and so on..

Yeah, but that's the easy stuff (in my book). We could support the same if we just had callbacks inside a mode that let you do things like save a little state and dynamically apply the rule or skip it, etc...

I'm talking about harder stuff like:

https://github.com/highlightjs/highlight.js/issues/1987

foo123 commented 5 years ago

If it can be modelled in a PEG grammar it can be done by the addon.

BNF spec... how deep does that get you? Are you really just lexing mostly or is it contextual enough to know you're inside a class, that this is a function call, not a variable definition (they can look the same in C, etc), etc?

Yes you can know if you are inside a class or function definition and so on, through grammar actions which create contexts. See for example the scoped grammar (further down the page) at this repo

joshgoebel commented 4 years ago

@foo123 Have you written a tokenizer/parser yourself for your grammar files? Or is all you have a bunch of conversation scripts to convert them into various grammar files recognized by other engines?

foo123 commented 4 years ago

@yyyc514 My grammar addons have a full-blown PEG-like parser of their own. In fact they all have the same generic parser and only the interface code changes to adapt to a specific editor/highlighter

joshgoebel commented 4 years ago

My grammar addons have a full-blown PEG-like parser of their own.

If that's true then you don't really need 99% of HLJS. You could use merely the new TokenTree plus HTMLRender... or perhaps only a slightly modified HTMLRender... that's only 42 lines of code.

First turning your PEG grammars into valid HLJS modes seems like a lot of hard work with very little upside... why not just use a plugin instead that replaces our parser with your own? You abort the actual parsing with a before callback and then in your after callback you'd have access to the emitter and could use it's simple API paired with your parsers output stream.

I'm not 100% sure all the pieces are in place to support that, but I'd love to see it tried. I'm very curious to see what a 3rd party plugin/grammar would look like that wasn't actually native HLJS based.

foo123 commented 4 years ago

@yyyc514 this is exactly what I was doing in the highlightjs-grammar addon, I tried to bypass default hjs tokenizing and replace with the parser of the addon. But (continouing from the other issue), that was a bit hacky so that is why I asked if a better more modular way exists which is not so hacky.

The grammar addons since they sport a full blown parser and support defining a grammar through BNF specifications are really very transferable from one editor/highlighter to another. Grammar remains the same only the editor custom stying tags change. You define the grammar once and use everywhere kind of addons

joshgoebel commented 4 years ago

that was a bit hacky so that is why I asked if a better more modular way exists which is not so hacky.

And the new way is using plugins. Shouldn’t need to touch the library source at all.

joshgoebel commented 4 years ago

Closing this issue - it never really received a lot of discussion. The new API callbacks for highlight and highlightBlock should make attempting this sort of thing much simpler in the future - not the mention the beta emitter support (if the existing parser did what you wanted regarding parsing itself but just didn't generate the right output). If anyone has any questions, feel free to ask.

gusbemacbe commented 2 years ago

Hello @foo123, @joshgoebel, @HKalbasi, and @RunDevelopment!

I was reading this conversation and also https://github.com/PrismJS/prism/issues/2848.

I was considering building new third-party languages that are missed in Highligjt.js, @jgm's Pandic and Prism.js (see the table below) because my enterprise's most of the colleagues to use them on their documentation sites

I was also considering using @foo123's highlightjs-grammar and prism-grammar to convert VSCode's third-party JSON syntaxes (converted from other types by @pedro-w's TextMate Languages), as:

because I was planning to include some of these languages, that most of the enterprise colleague use more, for my VSCode Markdown theme project I am writing (using Highlight.js and Pandoc) and for the developers' Angular and React project (using Prism.js). I was working on changing Highlight.js and Pandoc CSS colours to match the Prism.js colours, based on the developers' ANgular project, for Markdown project.I noticed the missed language in comparison between syntax highlighters. @jgm's Pandoc missed more languages than both Highlight.js and Prism.js, and Highlight.js missed more than Prism.js. You can analyse the table:

Table of comparison between syntax highlighters supporting languages

🚫 - Not available
⚠️ - Third-party
❓ - Unknown or untested

Language Highlight.js Pandoc Prism.js
.ignore 🚫 gitignore, hgignore, ignore, npmignore
.properties properties 🚫 properties
Ada ada
Apache Configuration apache, apacheconf apache apacheconf
AppleScript applescript 🚫 applescript
Arduino arduino 🚫 arduino
AsciiDoc asciidoc 🚫 asciidoc
ASP.NET aspnet
Awk awk
Bash bash
Batch bat, cmd, dos 🚫 batch
BBCode ⚠️ bbcode 🚫 bbcode
\mathrm{Bib}\TeX 🚫 bibtex 🚫
Changelog 🚫 changelog 🚫
Clojure clojure
CMake cmake
COBOL ⚠️ cobol 🚫 cobol
CoffeScript coffeescript, coffee coffee coffeescript, coffee
Coq coq 🚫 coq
C c
C++ cpp, c++ cpp cpp
C# csharp, cs cs csharp, cs , dotnet
CSS css
CSS Extras css-extras
CSV 🚫 🚫 csv
cURL ⚠️ curl 🚫
Dart dart 🚫 dart
Delphi delphi 🚫 🚫
diff diff
Django django djangotemplate django
Docker docker, dockerfile dockerfile docker, dockerfile
DNS Zone dns 🚫 dns-zone
EditorConfig 🚫 🚫 editorconfig
Eiffel 🚫 eiffel eiffel
EJS ejs
Elixir elixir
Elm elm
ERB erb 🚫 erb
Erlang erlang
Excel Formula excel, xls xlsx 🚫 excel-formula, xls xlsx
Fortran fortran fortranfixed, fortranfree fortran
F# fsharp
gettext 🚫 🚫 gettext, po
Gherkin gherkin 🚫 gherkin
Git git
Go go go go
Go module go-module
Gradle gradle 🚫 gradle
GraphQL graphql
Groovy groovy
HAML haml 🚫 haml
Haskell haskell
HTML html
HTTP and HTTPS http, https 🚫 http, hsts
Ini ini
Java java
JavaDoc javadoc
Java JSP jsp jsp
JavaScript js, javascript javascript js, javascript
JavaScript Doc jsdoc
JavaScript Extras js-extras
JavaScript Templates js-templates
JSON json
JSON5 json5
JSONP jsonp
Julia julia
Kotlin kotlin, kt kotlin kotlin, kt, kts
\LaTeX latex, tex latex latex, tex
Less less 🚫 less
Linden Scripting Language lsl 🚫 🚫
Liquid 🚫 🚫 liquid
Lisp lisp commonlisp lisp
Log 🚫 🚫 log
Lua lua
Makefile makefile
Markdown markdown, md markdown markdown, md
Mathematica mathematica, mma, wl mathematica mathematica, wolfram, wl
MatLab matlab
Maxima maxima 🚫
Mercury mercury 🚫 🚫
MongoDB 🚫 🚫 mongodb
neo4j ⚠️ cypher 🚫 cypher
ngnix nginx, nginxconf 🚫 nginx
Nim nim
Nix nix 🚫 nix
OCaml ocaml
Objective C objectivec, objc, obj-c objectivec objectivec
Objective C++ obj-c++, objective-c++ objectivecpp
Octave ⚠️ octave octave 🚫
Patch patch
Pascal 🚫 pascal pascal
Perl perl
PHP php php php
PHP Blade ⚠️ blade
PHP Doc php-doc
PHP Extras php-extras
PHP Template php-template
PlantUML 🚫 🚫 plant-uml, plantuml
PostgreSQL and PL/pgSQL pgsql, postgres, postgresql 🚫 plsql
PostScript 🚫 postscript 🚫
PowerShell powershell, ps, ps1 powershell powershell
Prolog prolog
Pug 🚫 🚫 pug
Python py, python python py, python
Python profile profile
Python REPL python-repl
R r
Razon C# ⚠️ cshtml, razor, razor-cshtml cshtml, razor
React JS javascriptreact
React JSX jsx jsx
React TSX tsx tsx
ReasonML reasonml 🚫 reason
Regex regex
RPM Specfic file ⚠️ rpm-specfile 🚫 🚫
Ruby ruby
Rust rust
SASS sass sass
Scala scala
Scheme scheme
Scilab scilab 🚫 🚫
SCSS scss
sed sed
Shell Terminal shell 🚫 shell-session
SmallTalk smalltalk 🚫 smalltalk
Splunk ⚠️ spl 🚫 splunk-spl
SQL sql
SQL MySQL sqlmysql
SQL PostgreSQL sqlpostgresql
Stylus stylus, styl 🚫 stylus
Systemd configuration file 🚫 🚫 systemd
Svelte ⚠️ svelte 🚫 🚫
SVG svg
Swift swift
tcsh 🚫 tcsh 🚫
Terraform ⚠️ terraform, tf 🚫 🚫
texinfo 🚫 texinfo 🚫
Textile 🚫 🚫 textile
TOML toml
Twig twig 🚫 twig
TypeScript typescript, ts typescript typescript, ts
Typoscript typoscript, tsconfig
URI uri, url
Vala vala 🚫 vala
VBScript vbscript 🚫
VBSCript with HTML vbscript-html 🚫 🚫
Verilog verilog
Vim Script vim 🚫 vim
Visual Basix vbnet 🚫 visual-basic
WebAssembly wasm 🚫 wasm
Wiki Markup 🚫 ❓ mediawiki wiki
XML xml
Xorg 🚫 xorg 🚫
XSLT 🚫 xslt 🚫
YAML yaml
ZSH zsh

Missed languages in all syntax highlighters

joshgoebel commented 2 years ago

I was also considering using @foo123's highlightjs-grammar

Interesting tool... though personally I'm far more interested in integration at runtime, like if perhaps we could all agree on some type of loose intermediate representation or API... such that you could just use "all the highlighters" - and pick your output type/theme (prism/highlightjs/etc) but then not have to care about which engine is doing the actual parsing.

I'd be curious to know if @foo123 just invented this JSON spec from scratch or if it's some type of standard?

It would seem far easier to agree on a simple output API here (token, type, etc) than "one true" unified parser spec...

joshgoebel commented 2 years ago

@RunDevelopment

Would you have any thoughts on this? Such as if Prism.js [for example] implemented our own Emitter API as an intermediate layer... then one could ask Prism.js to do the parsing but pass it a HLJS Emitter instance and so the final HTML output would be HLJS style - and all our themes work, etc... any interest in this type of cross-compatibility? And of course the opposite would work as well, using Highlight.js grammars easily inside of Prism.js - but getting Prism style output/themes, etc...

The rough API:

https://github.com/highlightjs/highlight.js/blob/main/src/lib/token_tree.js#L104

foo123 commented 2 years ago

I'd be curious to know if foo123 just invented this JSON spec from scratch or if it's some type of standard?

@joshgoebel The answer is both. As explained here the format used BNF+JSON, combines two standards: BNF format for grammar specification and JSON format (that includes the BNF format) as a versatile and ubiquitus data format. The details of the JSON format (ie names and the meaning of fields) was made up by me based on conceptual separation of functionalities (ie lex model, syntax model, style model, extra settings), but can be adjusted easily.

RunDevelopment commented 2 years ago

any interest in this type of cross-compatibility

Honestly, yes. I've already seen multiple projects having to make multiple grammars for different syntax highlighters. I think it would be a win for the wider ecosystem if the two largest JS syntax highlighter supported the 3 most common grammar formats (TextMate grammars, HighlightJS grammars, PrismJS grammars).

We are currently working on Prism v2 and that includes a new custom tokenization API. This API is very simple. It's essentially just a function (code: string) => TokenStream (simplified), so as long as you can make a wrapper around a parser to convert its foreign AST/tokens into a Prism token stream, it will be supported. It should be possible to make an adapter for TextMate grammars and HighlightJS grammars. So from the token side of things, Prism will be able to support pretty much any parser/grammar format with the right adapter.

Themes are a different beast, though. Both HighlightJS and TextMate use scopes to assign colors. Prism does not. So while it should be possible to approximately map HighlightJS/TextMate scopes to Prism class names, the opposite direction will be harder.

However, it's not impossible. Prism has standard tokens with clearly defined semantics. So just mapping these standard tokens to scopes will do 80% of the job. Now for the remaining 20%. While tokens can have any number of class names, I would map these class names to scopes as follows:

  1. If the class names contain one standard token, use the mapped scope of that standard token.
    1. If the class names contain multiple standard tokens (very rare, but might happen), pick the first one.
    2. If the class names contain any non-standard tokens, pick the first one and add it as a subscope.
  2. If the class names do not contain any standard tokens, output the scope for plain text.

Not a perfect approach, but it should be a good start.

joshgoebel commented 2 years ago

This API is very simple. It's essentially just a function (code: string) => TokenStream (simplified)

Is this API still in development, is feedback welcome? I already shared our API we use for this (internally), would love to see what you're idea is looking like (interface wise).

On our side implementing this means tying these two interfaces together. IE, to produce "prism" brand highlighting we'd have a custom Emitter that built a TokenStream... then it's to_html method would need a way to ask Prism to turn that TokenStream into HTML.

RunDevelopment commented 2 years ago

Is this API still in development, is feedback welcome?

Still under development (current API) and yes :) The basic idea is that Prism's tokens, token streams, Prism.tokenize are already part of the public API. Custom tokenizers are essentially just a way to make your own Prism.tokenize. This allows grammars to choose their own tokenizer.

However, the user-facing API remains the same. Users still just call Prism.tokenize (or any of the higher-level functions) and this function invokes the custom tokenizer (if any). So custom tokenizers are a complete implementation detail and do not affect how grammars are used.

joshgoebel commented 2 years ago

@RunDevelopment Is your full list of scopes/class names documented anywhere now? I think we'd just need to work on a mapping table of our scopes to your class names.

Prism.tokenize are already part of the public API.

And the TokenStream API stable then? If so (from our side) I'm not sure we'd even need the per-grammar support... I imagine we could provide a wrapper on our end:

let javaPrism = hljs.wrapPrism(prism.languages.java)
hljs.registerLanguage("java", javaPrism)

And to our own grammars we'd add some sort of parseYourself (your tokenize) that took an emitter object as input... so any grammar could entirely replace the HLJS parser with a custom parser - only using our emitting pipeline.

And internally we'd do something like:

// call Prism.tokenize
// create an empty HLJS TokenTree
// loop over the Prism TokenStream, feeding it into our own TokenTree
// call `to_html` on our own token tree

All the magic happening inside the loop where we'd have to deal with conversation of your class names into our scopes. Does that sound about right?

joshgoebel commented 2 years ago

Not a perfect approach, but it should be a good start.

Could you possibly provide some actual examples for your 20% edge cases? When would you have multiple or non-standard tokens? Is this just because you'd allowed your grammars to get a little wild with what they are tagging things?

joshgoebel commented 2 years ago

@RunDevelopment Any chance we could talk you into adding some actual metadata to your own grammars in v2 like languages name, common extensions, etc? :-)

Otherwise it's probably going to look more like so we can add back critical metadata:

let p = hljs.wrapPrism(prism.languages.csharp, {
  name: ".NET C#",
  aliases: ["cs"]
})
hljs.registerLanguage("csharp", p)

It'd be nice if you just pre-packaged it. :)

RunDevelopment commented 2 years ago

Is your full list of scopes/class names documented anywhere now?

Yes, and no. We have documented our standard tokens that are widely supported by themes and used in our languages, but there are also non-standard tokens.

Could you possibly provide some actual examples for your 20% edge cases? When would you have multiple or non-standard tokens? Is this just because you'd allowed your grammars to get a little wild with what they are tagging things?

Prism themes work using CSS specificity. So by adding more non-standard tokens, you gain more fine control over your highlighting.

One example would be keywords. Many modern IDEs highlight control-flow keywords differently. In Prism, you can just give control flow keywords an additional control-flow class name, and themes that support the .keyword.control-flow selector will then highlight your control flow keyword differently. Example.

Even more fine-grained control can be obtained by adding more non-standard token, but most languages don't do that because even one non-standard token is already very specific.

And the TokenStream API stable then?

Yes. I believe the API has been stable since v1.0 and there are no plans to change it in v2.

I imagine we could provide a wrapper on our end

That's also what I was thinking for Prism. We would just need a function that wraps HLJS grammars into a custom tokenizer and that's it. Ideally, that function would even be another package completely independent of Prism. Instead of just adding support for a few grammar formats as part of the core library, I'd rather have independent packages that use Prism's API to support different parsers. This would enable people to write third party adapters for their own parsers using the same APIs.

All the magic happening inside the loop where we'd have to deal with conversation of your class names into our scopes. Does that sound about right?

Yup, that sounds like it'll work :)

Any chance we could talk you into adding some actual metadata to your own grammars in v2 like languages name, common extensions, etc? :-)

We will ship aliases in v2 (example), but no titles. The thing with titles is that they aren't that useful. We only have one plugin that needs titles, and it just has a generated mapping with all titles.

What does HLJS use titles for anyway?

joshgoebel commented 2 years ago

What does HLJS use titles for anyway?

User convenience and completeness? LOL. I think they got added because I wanted to use them in the UI of a project.

The thing with titles is that they aren't that useful.

You're not wrong. 🤔

joshgoebel commented 2 years ago

@RunDevelopment And it works. :-) About 30 lines of code without a proper style name mapper. 2-3 line patch to HLJS itself to allow grammars to provider their own tokenizers.

Is there some good way to enumerate all the languages? Just Object.keys against Prism.languages? I'm guessing perhaps this all changes with v2 though? (sorry haven't fully been following that discussion)

joshgoebel commented 2 years ago

See also #3620. Laying out all the pieces for anyone who wants to play with this.

RunDevelopment commented 2 years ago

Is there some good way to enumerate all the languages?

Not yet. Prism v2 will have a components property, which is a registry. Adding some function to enumerate all languages/plugins wouldn't be a bad thing to add.

However, that only gets you all currently loaded components (=languages + plugins) of that Prism instance. So there might be other instances with more. We could expose our auto-generated list of all languages, but I'd rather not because that doesn't include third-party components. It's also highly version specific.

joshgoebel commented 2 years ago

I guess i was imagining perhaps a "simplest thing possible" API like:

importToHLJS(Prism, hljs)

Where that just asked Prism for the "supported/loaded/installed" languages and then loop, wrap them, and inject them into HLJS.

RunDevelopment commented 2 years ago

Ah, in that case, enumerating all languages should be enough.

joshgoebel commented 8 months ago

Closing via #3620. If those engines can be run in a browsers Javascript runtime then someone could indeed hook them up to us via #3620. So it's easy to do now (on our side) - just remains for someone else to put in the elbow grease to make it happen.