Closed joshgoebel closed 8 months ago
Check out this addon https://github.com/foo123/highlightjs-grammar which works for all syntax highlighters and enables you to define a grammar specification for a language and use it as is in every highlighter of your choice (including highlightjs and prism)
and enables you to define a grammar specification for a language
I think that is the hard part unless you plan to publish and maintain a huge repository of grammars yourself. This is why (most) people come to a library I think, for the supported languages. :-) For most people saying "you just have to write your own grammar" is almost as bad as "you just have to write your own parser". :-)
What we'd gain (if say we supported Prism grammars) is instant support for all the languages that Prism supports that we do not. The question is might that hurt us by causing fewer people to want to developer Highlight.js grammars in the first place...
@yyyc514 of course you have a point in that grammars need to be written themselves, else it would be good if a repository of grammars exists. I dont mind having such a repository (maybe in the future) and the repos themselves have already made grammars for some popular languages (eg php, python, css, javascript, html, etc.) as examples on how to create fully-functional grammars.
But again the point of the grammar addons started through my own use cases where I had to highlight custom languages or dsl's that were not supported by any known language. And it would be a pain in the neck if I had to create parsers all the time, especialy if the parser was tied to a specific highlighter and had to make a new one for another highlighter. So highlighting custom languages was the starting point of the grammar addons.
Of course creating a grammar requires a minimal knowledge (eg BNF specifications) but again it is very minimal knowledge than writing a full parser and having to delve into the internals of a syntax highlight framework. So I think it is a win-win situtation.
Again all these are optional, only for those who benefit from such an approach, the others can use what exists already as before. This is only an addon, it is not necesarily a replacer of existing frameworks or languages.
BNF spec... how deep does that get you? Are you really just lexing mostly or is it contextual enough to know you're inside a class, that this is a function call, not a variable definition (they can look the same in C, etc), etc?
I'd be curious to see a BNF for JSX or something complicated and see how well it worked compared to say what we currently have. Probably things we could learn.
The parsers created by the grammar addons are similar to PEGs which are mildly stronger than context-free parsers. But there are other features (like grammar actions
) which enable strong context-sensitive parsing (not simply context-free).
Take a look at the xml grammar which not only highlights xml properly but detects errors, duplicate attributes, duplicate ids and so on..
The grammar reference is here
It is highly versatile, and eases the burden a lot in creating highly functional and feature-free parsers with only minimal knowledge (and in fact knowledge about a language that is needed anyway, eg what is a keyword, an identifier, a comment for that language and so on..).
Of course someone may need a feature not supported by grammar specification or even not possible at all by a grammar specification (and in that case would need to write his own highly detailed parser), but I doubt one will be found.
Take a look at the xml grammar which not only highlights xml properly but detects errors, duplicate attributes, duplicate ids and so on..
Yeah, but that's the easy stuff (in my book). We could support the same if we just had callbacks inside a mode that let you do things like save a little state and dynamically apply the rule or skip it, etc...
I'm talking about harder stuff like:
If it can be modelled in a PEG grammar it can be done by the addon.
BNF spec... how deep does that get you? Are you really just lexing mostly or is it contextual enough to know you're inside a class, that this is a function call, not a variable definition (they can look the same in C, etc), etc?
Yes you can know if you are inside a class or function definition and so on, through grammar actions which create contexts. See for example the scoped grammar (further down the page) at this repo
@foo123 Have you written a tokenizer/parser yourself for your grammar files? Or is all you have a bunch of conversation scripts to convert them into various grammar files recognized by other engines?
@yyyc514 My grammar addons have a full-blown PEG-like parser of their own. In fact they all have the same generic parser and only the interface code changes to adapt to a specific editor/highlighter
My grammar addons have a full-blown PEG-like parser of their own.
If that's true then you don't really need 99% of HLJS. You could use merely the new TokenTree plus HTMLRender... or perhaps only a slightly modified HTMLRender... that's only 42 lines of code.
First turning your PEG grammars into valid HLJS modes seems like a lot of hard work with very little upside... why not just use a plugin instead that replaces our parser with your own? You abort the actual parsing with a before callback and then in your after callback you'd have access to the emitter and could use it's simple API paired with your parsers output stream.
I'm not 100% sure all the pieces are in place to support that, but I'd love to see it tried. I'm very curious to see what a 3rd party plugin/grammar would look like that wasn't actually native HLJS based.
@yyyc514 this is exactly what I was doing in the highlightjs-grammar addon, I tried to bypass default hjs tokenizing and replace with the parser of the addon. But (continouing from the other issue), that was a bit hacky so that is why I asked if a better more modular way exists which is not so hacky.
The grammar addons since they sport a full blown parser and support defining a grammar through BNF specifications are really very transferable from one editor/highlighter to another. Grammar remains the same only the editor custom stying tags change. You define the grammar once and use everywhere kind of addons
that was a bit hacky so that is why I asked if a better more modular way exists which is not so hacky.
And the new way is using plugins. Shouldn’t need to touch the library source at all.
Closing this issue - it never really received a lot of discussion. The new API callbacks for highlight
and highlightBlock
should make attempting this sort of thing much simpler in the future - not the mention the beta emitter
support (if the existing parser did what you wanted regarding parsing itself but just didn't generate the right output). If anyone has any questions, feel free to ask.
Hello @foo123, @joshgoebel, @HKalbasi, and @RunDevelopment!
I was reading this conversation and also https://github.com/PrismJS/prism/issues/2848.
I was considering building new third-party languages that are missed in Highligjt.js, @jgm's Pandic and Prism.js (see the table below) because my enterprise's most of the colleagues to use them on their documentation sites
I was also considering using @foo123's highlightjs-grammar and prism-grammar to convert VSCode's third-party JSON syntaxes (converted from other types by @pedro-w's TextMate Languages), as:
.env
by @mikestead;conf
, cfg
, log
, out
and tmp
by @xshrim;because I was planning to include some of these languages, that most of the enterprise colleague use more, for my VSCode Markdown theme project I am writing (using Highlight.js and Pandoc) and for the developers' Angular and React project (using Prism.js). I was working on changing Highlight.js and Pandoc CSS colours to match the Prism.js colours, based on the developers' ANgular project, for Markdown project.I noticed the missed language in comparison between syntax highlighters. @jgm's Pandoc missed more languages than both Highlight.js and Prism.js, and Highlight.js missed more than Prism.js. You can analyse the table:
🚫 - Not available
⚠️ - Third-party
❓ - Unknown or untested
Language | Highlight.js | Pandoc | Prism.js |
---|---|---|---|
.ignore |
❓ | 🚫 | gitignore , hgignore , ignore ,
npmignore
|
.properties |
properties |
🚫 | properties |
Ada | ada |
||
Apache Configuration | apache , apacheconf |
apache |
apacheconf |
AppleScript | applescript |
🚫 | applescript |
Arduino | arduino |
🚫 | arduino |
AsciiDoc | asciidoc |
🚫 | asciidoc |
ASP.NET | ❓ | ❓ | aspnet |
Awk | awk |
||
Bash | bash |
||
Batch | bat , cmd , dos |
🚫 | batch |
BBCode | ⚠️ bbcode |
🚫 | bbcode |
\mathrm{Bib}\TeX | 🚫 | bibtex |
🚫 |
Changelog | 🚫 | changelog |
🚫 |
Clojure | clojure |
||
CMake | cmake |
||
COBOL | ⚠️ cobol |
🚫 | cobol |
CoffeScript | coffeescript , coffee |
coffee |
coffeescript , coffee |
Coq | coq |
🚫 | coq |
C | c |
||
C++ | cpp , c++ |
cpp |
cpp |
C# | csharp , cs |
cs |
csharp , cs , dotnet |
CSS | css |
||
CSS Extras | ❓ | ❓ | css-extras |
CSV | 🚫 | 🚫 | csv |
cURL | ⚠️ curl |
🚫 | ❓ |
Dart | dart |
🚫 | dart |
Delphi | delphi |
🚫 | 🚫 |
diff | diff |
||
Django | django |
djangotemplate |
django |
Docker | docker , dockerfile |
dockerfile |
docker , dockerfile |
DNS Zone | dns |
🚫 | dns-zone |
EditorConfig | 🚫 | 🚫 | editorconfig |
Eiffel | 🚫 | eiffel |
eiffel |
EJS | ❓ | ❓ | ejs |
Elixir | elixir |
||
Elm | elm |
||
ERB | erb |
🚫 | erb |
Erlang | erlang |
||
Excel Formula | excel , xls xlsx |
🚫 | excel-formula , xls xlsx |
Fortran | fortran |
fortranfixed , fortranfree |
fortran |
F# | fsharp |
||
gettext | 🚫 | 🚫 | gettext , po |
Gherkin | gherkin |
🚫 | gherkin |
Git | ❓ | ❓ | git |
Go | go |
go |
go |
Go module | ❓ | ❓ | go-module |
Gradle | gradle |
🚫 | gradle |
GraphQL | graphql |
||
Groovy | groovy |
||
HAML | haml |
🚫 | haml |
Haskell | haskell |
||
HTML | html |
||
HTTP and HTTPS | http , https |
🚫 | http , hsts |
Ini | ini |
||
Java | java |
||
JavaDoc | ❓ | javadoc |
❓ |
Java JSP | jsp |
jsp |
❓ |
JavaScript | js , javascript |
javascript |
js , javascript |
JavaScript Doc | ❓ | ❓ | jsdoc |
JavaScript Extras | ❓ | ❓ | js-extras |
JavaScript Templates | ❓ | ❓ | js-templates |
JSON | json |
||
JSON5 | ❓ | ❓ | json5 |
JSONP | ❓ | ❓ | jsonp |
Julia | julia |
||
Kotlin | kotlin , kt |
kotlin |
kotlin , kt , kts |
\LaTeX | latex , tex |
latex |
latex , tex |
Less | less |
🚫 | less |
Linden Scripting Language | lsl |
🚫 | 🚫 |
Liquid | 🚫 | 🚫 | liquid |
Lisp | lisp |
commonlisp |
lisp |
Log | 🚫 | 🚫 | log |
Lua | lua |
||
Makefile | makefile |
||
Markdown | markdown , md |
markdown |
markdown , md |
Mathematica | mathematica , mma , wl |
mathematica |
mathematica , wolfram , wl |
MatLab | matlab |
||
Maxima | maxima |
🚫 | |
Mercury | mercury |
🚫 | 🚫 |
MongoDB | 🚫 | 🚫 | mongodb |
neo4j | ⚠️ cypher |
🚫 | cypher |
ngnix | nginx , nginxconf |
🚫 | nginx |
Nim | nim |
||
Nix | nix |
🚫 | nix |
OCaml | ocaml |
||
Objective C | objectivec , objc , obj-c |
objectivec |
objectivec |
Objective C++ | obj-c++ , objective-c++ |
objectivecpp |
❓ |
Octave | ⚠️ octave |
octave |
🚫 |
Patch | patch |
❓ | ❓ |
Pascal | 🚫 | pascal |
pascal |
Perl | perl |
||
PHP | php |
php |
php |
PHP Blade | ⚠️ blade |
❓ | ❓ |
PHP Doc | ❓ | ❓ | php-doc |
PHP Extras | ❓ | ❓ | php-extras |
PHP Template | php-template |
❓ | ❓ |
PlantUML | 🚫 | 🚫 | plant-uml , plantuml |
PostgreSQL and PL/pgSQL | pgsql , postgres ,
postgresql
|
🚫 | plsql |
PostScript | 🚫 | postscript |
🚫 |
PowerShell | powershell , ps , ps1 |
powershell |
powershell |
Prolog | prolog |
||
Pug | 🚫 | 🚫 | pug |
Python | py , python |
python |
py , python |
Python profile | profile |
❓ | ❓ |
Python REPL | python-repl |
❓ | ❓ |
R | r |
||
Razon C# | ⚠️ cshtml , razor ,
razor-cshtml
|
❓ | cshtml , razor |
React JS | ❓ | javascriptreact |
❓ |
React JSX | jsx |
❓ | jsx |
React TSX | tsx |
❓ | tsx |
ReasonML | reasonml |
🚫 | reason |
Regex | ❓ | ❓ | regex |
RPM Specfic file | ⚠️ rpm-specfile |
🚫 | 🚫 |
Ruby | ruby |
||
Rust | rust |
||
SASS | ❓ | sass |
sass |
Scala | scala |
||
Scheme | scheme |
||
Scilab | scilab |
🚫 | 🚫 |
SCSS | scss |
||
sed |
❓ | sed |
❓ |
Shell Terminal | shell |
🚫 | shell-session |
SmallTalk | smalltalk |
🚫 | smalltalk |
Splunk | ⚠️ spl |
🚫 | splunk-spl |
SQL | sql |
||
SQL MySQL | ❓ | sqlmysql |
❓ |
SQL PostgreSQL | ❓ | sqlpostgresql |
❓ |
Stylus | stylus , styl |
🚫 | stylus |
Systemd configuration file | 🚫 | 🚫 | systemd |
Svelte | ⚠️ svelte |
🚫 | 🚫 |
SVG | svg |
||
Swift | swift |
||
tcsh | 🚫 | tcsh |
🚫 |
Terraform | ⚠️ terraform , tf |
🚫 | 🚫 |
texinfo |
🚫 | texinfo |
🚫 |
Textile | 🚫 | 🚫 | textile |
TOML | toml |
||
Twig | twig |
🚫 | twig |
TypeScript | typescript , ts |
typescript |
typescript , ts |
Typoscript | ❓ | ❓ | typoscript , tsconfig |
URI | ❓ | ❓ | uri , url |
Vala | vala |
🚫 | vala |
VBScript | vbscript |
🚫 | ❓ |
VBSCript with HTML | vbscript-html |
🚫 | 🚫 |
Verilog | verilog |
||
Vim Script | vim |
🚫 | vim |
Visual Basix | vbnet |
🚫 | visual-basic |
WebAssembly | wasm |
🚫 | wasm |
Wiki Markup | 🚫 | ❓ mediawiki | wiki |
XML | xml |
||
Xorg | 🚫 | xorg |
🚫 |
XSLT | 🚫 | xslt |
🚫 |
YAML | yaml |
||
ZSH | zsh |
.env
I was also considering using @foo123's highlightjs-grammar
Interesting tool... though personally I'm far more interested in integration at runtime, like if perhaps we could all agree on some type of loose intermediate representation or API... such that you could just use "all the highlighters" - and pick your output type/theme (prism/highlightjs/etc) but then not have to care about which engine is doing the actual parsing.
I'd be curious to know if @foo123 just invented this JSON spec from scratch or if it's some type of standard?
It would seem far easier to agree on a simple output API here (token, type, etc) than "one true" unified parser spec...
@RunDevelopment
Would you have any thoughts on this? Such as if Prism.js [for example] implemented our own Emitter
API as an intermediate layer... then one could ask Prism.js to do the parsing but pass it a HLJS Emitter instance and so the final HTML output would be HLJS style - and all our themes work, etc... any interest in this type of cross-compatibility? And of course the opposite would work as well, using Highlight.js grammars easily inside of Prism.js - but getting Prism style output/themes, etc...
The rough API:
https://github.com/highlightjs/highlight.js/blob/main/src/lib/token_tree.js#L104
I'd be curious to know if foo123 just invented this JSON spec from scratch or if it's some type of standard?
@joshgoebel The answer is both. As explained here the format used BNF+JSON, combines two standards: BNF format for grammar specification and JSON format (that includes the BNF format) as a versatile and ubiquitus data format. The details of the JSON format (ie names and the meaning of fields) was made up by me based on conceptual separation of functionalities (ie lex model, syntax model, style model, extra settings), but can be adjusted easily.
any interest in this type of cross-compatibility
Honestly, yes. I've already seen multiple projects having to make multiple grammars for different syntax highlighters. I think it would be a win for the wider ecosystem if the two largest JS syntax highlighter supported the 3 most common grammar formats (TextMate grammars, HighlightJS grammars, PrismJS grammars).
We are currently working on Prism v2 and that includes a new custom tokenization API. This API is very simple. It's essentially just a function (code: string) => TokenStream
(simplified), so as long as you can make a wrapper around a parser to convert its foreign AST/tokens into a Prism token stream, it will be supported. It should be possible to make an adapter for TextMate grammars and HighlightJS grammars. So from the token side of things, Prism will be able to support pretty much any parser/grammar format with the right adapter.
Themes are a different beast, though. Both HighlightJS and TextMate use scopes to assign colors. Prism does not. So while it should be possible to approximately map HighlightJS/TextMate scopes to Prism class names, the opposite direction will be harder.
However, it's not impossible. Prism has standard tokens with clearly defined semantics. So just mapping these standard tokens to scopes will do 80% of the job. Now for the remaining 20%. While tokens can have any number of class names, I would map these class names to scopes as follows:
Not a perfect approach, but it should be a good start.
This API is very simple. It's essentially just a function (code: string) => TokenStream (simplified)
Is this API still in development, is feedback welcome? I already shared our API we use for this (internally), would love to see what you're idea is looking like (interface wise).
On our side implementing this means tying these two interfaces together. IE, to produce "prism" brand highlighting we'd have a custom Emitter that built a TokenStream... then it's to_html
method would need a way to ask Prism to turn that TokenStream into HTML.
Is this API still in development, is feedback welcome?
Still under development (current API) and yes :) The basic idea is that Prism's tokens, token streams, Prism.tokenize
are already part of the public API. Custom tokenizers are essentially just a way to make your own Prism.tokenize
. This allows grammars to choose their own tokenizer.
However, the user-facing API remains the same. Users still just call Prism.tokenize
(or any of the higher-level functions) and this function invokes the custom tokenizer (if any). So custom tokenizers are a complete implementation detail and do not affect how grammars are used.
@RunDevelopment Is your full list of scopes/class names documented anywhere now? I think we'd just need to work on a mapping table of our scopes to your class names.
Prism.tokenize are already part of the public API.
And the TokenStream API stable then? If so (from our side) I'm not sure we'd even need the per-grammar support... I imagine we could provide a wrapper on our end:
let javaPrism = hljs.wrapPrism(prism.languages.java)
hljs.registerLanguage("java", javaPrism)
And to our own grammars we'd add some sort of parseYourself
(your tokenize) that took an emitter object as input... so any grammar could entirely replace the HLJS parser with a custom parser - only using our emitting pipeline.
And internally we'd do something like:
// call Prism.tokenize
// create an empty HLJS TokenTree
// loop over the Prism TokenStream, feeding it into our own TokenTree
// call `to_html` on our own token tree
All the magic happening inside the loop where we'd have to deal with conversation of your class names into our scopes. Does that sound about right?
Not a perfect approach, but it should be a good start.
Could you possibly provide some actual examples for your 20% edge cases? When would you have multiple or non-standard tokens? Is this just because you'd allowed your grammars to get a little wild with what they are tagging things?
@RunDevelopment Any chance we could talk you into adding some actual metadata to your own grammars in v2 like languages name, common extensions, etc? :-)
Otherwise it's probably going to look more like so we can add back critical metadata:
let p = hljs.wrapPrism(prism.languages.csharp, {
name: ".NET C#",
aliases: ["cs"]
})
hljs.registerLanguage("csharp", p)
It'd be nice if you just pre-packaged it. :)
Is your full list of scopes/class names documented anywhere now?
Yes, and no. We have documented our standard tokens that are widely supported by themes and used in our languages, but there are also non-standard tokens.
Could you possibly provide some actual examples for your 20% edge cases? When would you have multiple or non-standard tokens? Is this just because you'd allowed your grammars to get a little wild with what they are tagging things?
Prism themes work using CSS specificity. So by adding more non-standard tokens, you gain more fine control over your highlighting.
One example would be keywords. Many modern IDEs highlight control-flow keywords differently. In Prism, you can just give control flow keywords an additional control-flow
class name, and themes that support the .keyword.control-flow
selector will then highlight your control flow keyword differently. Example.
Even more fine-grained control can be obtained by adding more non-standard token, but most languages don't do that because even one non-standard token is already very specific.
And the TokenStream API stable then?
Yes. I believe the API has been stable since v1.0 and there are no plans to change it in v2.
I imagine we could provide a wrapper on our end
That's also what I was thinking for Prism. We would just need a function that wraps HLJS grammars into a custom tokenizer and that's it. Ideally, that function would even be another package completely independent of Prism. Instead of just adding support for a few grammar formats as part of the core library, I'd rather have independent packages that use Prism's API to support different parsers. This would enable people to write third party adapters for their own parsers using the same APIs.
All the magic happening inside the loop where we'd have to deal with conversation of your class names into our scopes. Does that sound about right?
Yup, that sounds like it'll work :)
Any chance we could talk you into adding some actual metadata to your own grammars in v2 like languages name, common extensions, etc? :-)
We will ship aliases in v2 (example), but no titles. The thing with titles is that they aren't that useful. We only have one plugin that needs titles, and it just has a generated mapping with all titles.
What does HLJS use titles for anyway?
What does HLJS use titles for anyway?
User convenience and completeness? LOL. I think they got added because I wanted to use them in the UI of a project.
The thing with titles is that they aren't that useful.
You're not wrong. 🤔
@RunDevelopment And it works. :-) About 30 lines of code without a proper style name mapper. 2-3 line patch to HLJS itself to allow grammars to provider their own tokenizers.
Is there some good way to enumerate all the languages? Just Object.keys
against Prism.languages? I'm guessing perhaps this all changes with v2 though? (sorry haven't fully been following that discussion)
See also #3620. Laying out all the pieces for anyone who wants to play with this.
Is there some good way to enumerate all the languages?
Not yet. Prism v2 will have a components
property, which is a registry. Adding some function to enumerate all languages/plugins wouldn't be a bad thing to add.
However, that only gets you all currently loaded components (=languages + plugins) of that Prism instance. So there might be other instances with more. We could expose our auto-generated list of all languages, but I'd rather not because that doesn't include third-party components. It's also highly version specific.
I guess i was imagining perhaps a "simplest thing possible" API like:
importToHLJS(Prism, hljs)
Where that just asked Prism for the "supported/loaded/installed" languages and then loop, wrap them, and inject them into HLJS.
Ah, in that case, enumerating all languages should be enough.
Closing via #3620. If those engines can be run in a browsers Javascript runtime then someone could indeed hook them up to us via #3620. So it's easy to do now (on our side) - just remains for someone else to put in the elbow grease to make it happen.
Brilliant idea or terrible idea? Would this enhance the ecosystem or destroy it?