highlightjs / highlight.js

JavaScript syntax highlighter with language auto-detection and zero dependencies.
https://highlightjs.org/
BSD 3-Clause "New" or "Revised" License
23.75k stars 3.61k forks source link

Provide access to a parse tree as an alternative to HTML output #1086

Closed derhuerst closed 4 years ago

derhuerst commented 8 years ago

I've just had a quick look at hightlight.js and didn't any discussion on this. So forgive me if this is not the right place or has already been discussed.

I'd like to propose to make highlight.js more generally usable, or at least pull out the core highlighting logic to make it independent from the output format. Just like the famous Pygments has formatters, the HTML generation would be done by a separate module/package.

highlight.js seems to be the de factor standard from syntax highlighting, even among porject not related to HTML/browsers. See the slap editor and hicat for examples: Both use it and later on manually walk/manipulate the HTML generated by highlight.js.

In the world of lightweight do-one-thing NPM packages, it would make sense to split highlight.js into two parts, one generating an treeish object representation (possibly streaming), and the second generating HTML from it.

isagalaev commented 8 years ago

This idea is hanging in the air for a while. It's never been a priority as it lacked real use-cases, but now as you mentioned those two it makes very real sense. Another problem is that in one or two attempts at doing this it was done as a part of bigger (and controversial) refactorings of the whole core library, so it haven't got merged. I will probably get to it myself eventually, but if you'd fancy giving it a try, I could give some guidance.

derhuerst commented 8 years ago

I will probably get to it myself eventually, but if you'd fancy giving it a try, I could give some guidance.

I'm having difficulties in understanding the inner workings of highlight.js, both because there is not a lot of documentation/comments and because the code is written in a really imperative jump-around way.

Also, it seems like HTML generation is really baked into the core of highlight.js.

isagalaev commented 8 years ago

Yes, it was deliberately done so in the past to squeeze the code size as much as possible. Currently this "optimization target" is much less desirable and that's why I'm all for refactoring this.

I'll do a short writeup about the architecture of the core to make it more comprehensible.

derhuerst commented 8 years ago

I'll do a short writeup about the architecture of the core to make it more comprehensible.

This sounds great, especially since there seems to be no render-agnostic snytax highlighter in JavaScript land right now.

isagalaev commented 8 years ago

I'll do a short writeup about the architecture of the core to make it more comprehensible.

A status update on this… Before writing anything I decided to first do a pretty significant upgrade to the parser that I was planning to do since a long time ago anyway but a few bugs came up lately bumped the importance of this, and incidentally I've finally came up with an actual plan on how to implement it. And there'd be no point in documenting the current core as it will change immediately after that.

(If interested, the change in question is one I described in this blog post, under the "Complex modes" section.)

dbkaplun commented 8 years ago

Author of slap text editor here. I was going to open a new ticket for this but looks like @derhuerst phrased it very well.

highlight.js has many other uses than highlighting in the browser. It would be great if parsing was decoupled from HTML generation. See also slap-editor/editor-widget#134. We are considering porting our current highlighting implementation using highlight.js, in combination with HTML manipulation with cheerio, to another option that would not require any HTML manipulation such as lowlight (which itself depends on highlight.js but does not seem to perform any HTML manipulation).

Glad to help out in any way if this feature request moves forward!

okonet commented 7 years ago

I'm using https://github.com/wooorm/lowlight and writing my RTF formatter ATM so I'd be interested in support for AST out of the box. I think this would allow adding different formatters as pygment does.

joshgoebel commented 5 years ago

Just like the famous Pygments has formatters, the HTML generation would be done by a separate module/package.

I think our HTML stuff has some unique benefits, but if there were two pieces people could always choice to use the one we bundled or replace it with a different once.

I've been digging into the parser a lot, so I'll keep this in mind.

@derhuerst Any idea what that parsed but not HTML format might look like?

derhuerst commented 5 years ago

@derhuerst Any idea what that parsed but not HTML format might look like?

Usually you annotate Abstract Syntax Trees (ASTs) generated by a parser with additional fields, e.g. highlight.js: {color: '#123123', bold: true}.

The specific mechanism to add this info of course depends on the specific AST format being used. The most well-known one is ESTree used by Mozilla and the Babel AST format (an extension of ESTree). There's also the unist AST format, which tries to support generic trees of markup, e.g. for HTML and for Markdown.

All of these AST formats have dozens or even hundreds of libraries built around them, supporting all kinds of use cases, from parsing to transformations to formatting.

joshgoebel commented 5 years ago

If we add this support to highlight.js natively, do we kill the whole lowlight project?

wooorm commented 5 years ago

Yes, probably! Although, highlight.js main focus is to create an HTML string, whereas lowlights main focus is the syntax tree. So I imagine highlight’s API to still be mostly the same, with new functions added on top, whereas lowlights internals will change but API stay the same?

joshgoebel commented 5 years ago

Well sure, we wouldn't change our API. We'd likely add a new method or two to return some sort of node tree instead of HTML... and maybe some method to turn a node tree into HTML... then we'd just wrap ourselves by glueing the two together with the old api.

joshgoebel commented 5 years ago

@wooorm Before building your own parser that I assume is derived from ours did you try to first add the functionality to our source? If so was there a reason you gave up on that approach and went a different way?

wooorm commented 5 years ago

I needed hast, and hast was new (my invention). If I recall correctly, people weren't interested in ASTs as much back then so I rolled my own instead of raising an issue requesting my own format to be supported

joshgoebel commented 5 years ago

Well I meant the core functionality. Once you had a list of parsed tokens you could turn it 8nto any format pretty easily I’d think.

wooorm commented 5 years ago

The core functionality is the same for highlight.js and lowlight, except in code style, and in one creating HTML and the other a syntax tree. I think my above comment clarifies that no, I did not raise an issue, and gives reasoning for why not? Here is a brief history for lowlight.

joshgoebel commented 5 years ago

The core functionality is the same for highlight.js and lowlight, except in code style,

Except it's not:

https://github.com/highlightjs/highlight.js/pull/2209 https://github.com/highlightjs/highlight.js/pull/2179 https://github.com/highlightjs/highlight.js/pull/2135

All 3 of these will likely bite you eventually unless you are watching closely and porting the same changes to your rewrite... if I were you I wouldn't want to keep up with things like that... that's one of the disadvantages of rewriting vs reusing the core engine.

wooorm commented 5 years ago

Yes, there are differences, and for the last almost-four-years I’ve worked on porting them over.

You assume that I want to maintain lowlight like this for years to come. I’d prefer for highlight to support syntax trees. You already asked this and I responded:

Yes, probably! [...] So I imagine highlight’s API to still be mostly the same, with new functions added on top, whereas lowlights internals will change but API stay the same?

joshgoebel commented 5 years ago

Just making sure you were aware. :-) I might play around with this idea just to see what the hook in points are... I don' think this would be a difficult thing to do, just need to figure out how to do it nicely. Our existing system isn't very modular. :)

joshgoebel commented 5 years ago

Something like this? Thoughts?

Content:

<p class="normal">
<p class="x{{className}}x">

Output:

<span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"normal"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"x</span></span></span><span class="hljs-template-variable">{{className}}</span><span class="xml"><span class="hljs-tag"><span class="hljs-string">x"</span>&gt;</span>
</span>

Parsetree:

{
    "root": true,
    "children": [
        {
            "children": [
                {
                    "tagged": "tag",
                    "children": [
                        "<",
                        {
                            "tagged": "name",
                            "text": "p"
                        },
                        " ",
                        {
                            "tagged": "attr",
                            "text": "class"
                        },
                        "=",
                        {
                            "tagged": "string",
                            "text": "\"normal\""
                        },
                        ">"
                    ]
                },
                "\n",
                {
                    "tagged": "tag",
                    "children": [
                        "<",
                        {
                            "tagged": "name",
                            "text": "p"
                        },
                        " ",
                        {
                            "tagged": "attr",
                            "text": "class"
                        },
                        "=",
                        {
                            "tagged": "string",
                            "text": "\"x"
                        }
                    ]
                }
            ],
            "tagged": "xml",
            "sublanguage": true
        },
        {
            "tagged": "template-variable",
            "text": "{{className}}"
        },
        {
            "children": [
                {
                    "tagged": "tag",
                    "children": [
                        {
                            "tagged": "string",
                            "text": "x\""
                        },
                        ">"
                    ]
                },
                "\n"
            ],
            "tagged": "xml",
            "sublanguage": true
        }
    ]
}
joshgoebel commented 5 years ago

The strange way it splits the tokens across the sublanguage is just how the parser currently works... Seems maybe wrong to me, but the idea here is that XML is a language "in-between Handlebars snippets", not that XML is the primary language and handlebars exists inside that.

The continuations feature is what allows the sublanguage to keep it's context while jumping in and out of handlebars blocks.

This might be worth some thought. Since Highlight.js has never separated the concepts before of "parse tree" from "output" all it's been concerned about is the styling of the raw output. And this type of weird terminate then pickup later works perfectly from a styling perspective, though it doesn't describe the "structure" all that well.

wooorm commented 5 years ago

on hljs@~9.15, lowlight gives:

{
  relevance: 4,
  language: 'xml',
  value: [
    {
      type: 'element',
      tagName: 'span',
      properties: { className: [ 'hljs-tag' ] },
      children: [
        { type: 'text', value: '<' },
        {
          type: 'element',
          tagName: 'span',
          properties: { className: [ 'hljs-name' ] },
          children: [ { type: 'text', value: 'p' } ]
        },
        { type: 'text', value: ' ' },
        {
          type: 'element',
          tagName: 'span',
          properties: { className: [ 'hljs-attr' ] },
          children: [ { type: 'text', value: 'class' } ]
        },
        { type: 'text', value: '=' },
        {
          type: 'element',
          tagName: 'span',
          properties: { className: [ 'hljs-string' ] },
          children: [ { type: 'text', value: '"normal"' } ]
        },
        { type: 'text', value: '>' }
      ]
    },
    { type: 'text', value: '\n' },
    {
      type: 'element',
      tagName: 'span',
      properties: { className: [ 'hljs-tag' ] },
      children: [
        { type: 'text', value: '<' },
        {
          type: 'element',
          tagName: 'span',
          properties: { className: [ 'hljs-name' ] },
          children: [ { type: 'text', value: 'p' } ]
        },
        { type: 'text', value: ' ' },
        {
          type: 'element',
          tagName: 'span',
          properties: { className: [ 'hljs-attr' ] },
          children: [ { type: 'text', value: 'class' } ]
        },
        { type: 'text', value: '=' },
        {
          type: 'element',
          tagName: 'span',
          properties: { className: [ 'hljs-string' ] },
          children: [ { type: 'text', value: '"x{{className}}x"' } ]
        },
        { type: 'text', value: '>' }
      ]
    }
  ]
}

It’s a bit verbose, so you can either go two routes:

  1. Benefit from the hast ecosystem by exposing hast. It has a ton of utilities.
  2. Go for a simpler AST, such as using strings instead of type: texts; using hljs token names instead of elements (kind: 'attr'). With 2. you’d get (basically what you have):
{
  relevance: 4,
  language: 'xml',
  value: [
    {
      kind: 'tag',
      children: [
        '<',
        {kind: 'name', children: [ 'p' ]},
        ' ',
        {kind: 'attr', children: [ 'class' ]},
        '=',
        {kind: 'string', children: [ '"normal"' ]},
        '>'
      ]
    },
    '\n',
    {
      kind: 'tag',
      children: [
        '<',
        {kind: 'name', children: [ 'p' ]},
        ' ',
        {kind: 'attr', children: [ 'class' ]},
        '=',
        {kind: 'string', children: [ '"x{{className}}x"' ]},
        '>'
      ]
    }
  ]
}
joshgoebel commented 5 years ago

The {{}} is from handlebars... yes it's a lot cleaner if you run it as a single language, but I purposely did the more complex route since a lot of complexity is hidden in our sublanguage stuff. I wonder what does your parser kick out for handlebars (on our latest master branch).

I see no reason to prefer a complex format. Someone can write a plug-in rather quickly that re-codes the tree into more complex formats without much difficulty. For example if you preferred your format you could just wrap the new tree and then generate your preferred AST from it.

Plugins that work "on-top" are trivial because you can just use the global hljs and assign your plugin to it as a function... ie hljs.parseToHast

I'm talking about doing this just to clean up the internal structure of the parser... it's just serendipitous that it could easily help with these other uses cases.

joshgoebel commented 5 years ago

Why do you prefer children/array vs text for text nodes? I guess it's slimly simpler conceptually.

wooorm commented 5 years ago

The {{}} is from handlebars... yes it's a lot cleaner if you run it as a single language, but I purposely did the more complex route since a lot of complexity is hidden in our sublanguage stuff. I wonder what does your parser kick out for handlebars (on our latest master branch).

I didn‘t know that! You didn’t mention that your example used handlebars, so I assumed XML. For handlebars the lowlight result looks the same structurally, but has different nodes. I’m not sure what lowlight spits out on master because it probably has some changes, that I need to backport. Feel free to try.

Plugins that work "on-top" are trivial because you can just use the global hljs and assign your plugin to it as a function... ie hljs.parseToHast

People in the Node ecosystem seem to like things to be modular and contained, instead of patched on top of other things and globals. Over the years I’ve found that useful as well. So for lowlight I’ll stick with a separate contained package instead of adding a method.

I'm talking about doing this just to clean up the internal structure of the parser... it's just serendipitous that it could easily help with these other uses cases.

That’s okay! I think (and as others in this thread above) that syntax trees are really interesting and allow for cool things. So I’d argue that both are important.


hast is for HTML, and built on top of unist, which is for content and other syntax trees. The goal is different than that of highlight.js. You can do whatever you want, I’m giving two examples.

The reason for having complete nodes instead of just strings is that nodes often have more information, such as positional info. You can read more about hast in its readme, and you may find unifiedjs.com interesting as well!

joshgoebel commented 5 years ago

I didn‘t know that! You didn’t mention that your example used handlebars, so I assumed XML. For handlebars the lowlight result looks the same structurally, but has different nodes.

Does it handle sublanguage the same though? Actually you shouldn't need the new changes since all I used was a boring old handlebars tag...

People in the Node ecosystem seem to like things to be modular and contained, instead of patched on top of other things and globals. Over the years I’ve found that useful as well. So for lowlight I’ll stick with a separate contained package instead of adding a method.

Well I was suggesting how to do it more as a "plugin" rather than "it's own thing"... like if you wanted to just throw a JS file after Highlight.js loads and 'enhance HLJS'... of course there are a zillion ways to go about it if we exposed some type of low-level API.

hast is for HTML, and built on top of unist, which is for content and other syntax trees. The goal is different than that of highlight.js. You can do whatever you want, I’m giving two examples.

I was merely saying the best way to go (at first) is likely the simplest... and then let people take that and run with it wherever they want - transforming trees is a pretty simple thing to do. And actually it feels more modular to do it that way also rather than forcing more complexity on us that we need... we build a small simple tree that works best for us internally, and then anyone can take it and transform it however they wish.

The reason for having complete nodes instead of just strings is that nodes often have more information, such as positional info.

True, but I'm not sure we're actually equipped to easily provide that type of info anyways at this point. Food for thought, for sure. You may have persuaded me to leave strings a a single child node (when they have a kind) though vs making them a whole separate type of key.

wooorm commented 5 years ago

Does it handle sublanguage the same though? Actually you shouldn't need the new changes since all I used was a boring old handlebars tag...

If you want to check out how lowlight does it, I suggest using runkit, for example, with this:

var util = require("util")
var lowlight = require("lowlight")

var doc = `<p class="normal">
<p class="x{{className}}x">`

var tree = lowlight.highlight('hbs', doc)

var res = util.inspect(tree, {depth: null})
console.log(res)

I get:

{ relevance: 5,
  language: 'hbs',
  value:
   [ { type: 'element',
       tagName: 'span',
       properties: { className: [ 'xml' ] },
       children:
        [ { type: 'element',
            tagName: 'span',
            properties: { className: [ 'hljs-tag' ] },
            children:
             [ { type: 'text', value: '<' },
               { type: 'element',
                 tagName: 'span',
                 properties: { className: [ 'hljs-name' ] },
                 children: [ { type: 'text', value: 'p' } ] },
               { type: 'text', value: ' ' },
               { type: 'element',
                 tagName: 'span',
                 properties: { className: [ 'hljs-attr' ] },
                 children: [ { type: 'text', value: 'class' } ] },
               { type: 'text', value: '=' },
               { type: 'element',
                 tagName: 'span',
                 properties: { className: [ 'hljs-string' ] },
                 children: [ { type: 'text', value: '"normal"' } ] },
               { type: 'text', value: '>' } ] },
          { type: 'text', value: '\n' },
          { type: 'element',
            tagName: 'span',
            properties: { className: [ 'hljs-tag' ] },
            children:
             [ { type: 'text', value: '<' },
               { type: 'element',
                 tagName: 'span',
                 properties: { className: [ 'hljs-name' ] },
                 children: [ { type: 'text', value: 'p' } ] },
               { type: 'text', value: ' ' },
               { type: 'element',
                 tagName: 'span',
                 properties: { className: [ 'hljs-attr' ] },
                 children: [ { type: 'text', value: 'class' } ] },
               { type: 'text', value: '=' },
               { type: 'element',
                 tagName: 'span',
                 properties: { className: [ 'hljs-string' ] },
                 children: [ { type: 'text', value: '"x' } ] } ] } ] },
     { type: 'element',
       tagName: 'span',
       properties: { className: [ 'hljs-template-variable' ] },
       children: [ { type: 'text', value: '{{className}}' } ] },
     { type: 'element',
       tagName: 'span',
       properties: { className: [ 'xml' ] },
       children: [ { type: 'text', value: 'x">' } ] } ] }

I was merely saying the best way to go (at first) is likely the simplest... and then let people take that and run with it wherever they want - transforming trees is a pretty simple thing to do. And actually it feels more modular to do it that way also rather than forcing more complexity on us that we need... we build a small simple tree that works best for us internally, and then anyone can take it and transform it however they wish.

You should do what makes sense for hljs! I do argue that internally a couple of objects instead of strings won't make much of a performance impact, and may even help with readability of code? Whatever floats your boat! Integrating with a semi-standard like hast does make sense though. One other example is that KaTeX has a similar issue open. So if people start using standard, it’ll benefit a bunch of people. (But again, whatever works for hljs!)

joshgoebel commented 5 years ago

I do argue that internally a couple of objects instead of strings won't make much of a performance impact, and may even help with readability of code? Whatever floats your boat!

Well I'm probably going to join strings, but I might leave them as children. right now I have a "cleanup" step that wraps all the "only text" children into "text: " nodes, but I'm actually not sure we need that. I just wrote a quick walk and build API:

  class HTMLBuilder {
    constructor() {
      this.out = ""
    }
    addText(text) {
      this.out += escape(text)
    }
    startNode(node) {
      var className = node.kind || node.sublanguage
      if (node.sublanguage || node.kind) {
        this.out += buildSpan(className, "", true, node.sublanguage)
        node.requiresClose = true
      }
    }
    endNode(node) {
      if (node.requiresClose) {
        this.out += spanEndTag
      }
    }
    value() {
      return this.out;
    }
  }

// ---

    static _walk(builder, node) {
      if (typeof node === "string") {
        builder.addText(node)
      } else if (node.children) {
        builder.startNode(node)
        node.children.forEach((child) => TokenTree._walk(builder, child))
        builder.endNode(node)
      }
      return builder;
    }

If it's really that simple I think I'd prefer to keep it simple and let people add whatever they want on top just by walking the tree. It's possible this is too coupled since I only have the single use case, but seems pretty easy to expand upon later if need be. The addText might be a little specific, really you could do this with only start and endNode... and let the builder figure out what kind of node it was.

wooorm commented 5 years ago

Maybe you can investigate how to integrate with document.createElement, React.createElement, Vue..., etc, to see how well it works in other cases?

joshgoebel commented 5 years ago

That's kind of out of scope though if you want to play around with it and have questions I'm happy to try and answer them... This just generates the raw HTML (which should be pretty easy to hook into most anything). I was just showing how easy it would be to go from the raw tree to "fully baked" HTML (like we already provide)... that's literally ALL the code someone would have to write to start with the RAW tree and build WHATEVER they wanted. Less than 50 lines. And if we expose the walk API then even less.

When I have it a little cleaned up I'll push a branch or PR for someone who'd like to play around with it a little.

In my experience in the past it's been faster to let the browser parse HTML (it's very good at that)... if someone was building an editor or doing selective updates though that's a whole other ballpark and a very custom use case. I don't know why you'd want to build the HTML one node at a time, but that'd probably be just another 30-40 LOC builder that you could just pass into walk, and boom done... return a HTML Node or whatever you wanted.

At this layer of the process though we aren't even necessarily running in a web browser... The new process would look like:

If someone really wanted to do:

... then they could.

joshgoebel commented 5 years ago

I'm not really sure how to expose it though. I don't want to lock us into an API super early... for a version one I think we might just expose a parse function... that would a return a native parseTree...

It might or might not include walk... (or we mark it experimental).... in any case in or out it shows that walking the tree is like 10 lines of code.

wooorm commented 5 years ago

That's kind of out of scope though if you want to play around with it and have questions I'm happy to try and answer them...

It is the scope of this issue, the title is to decouple from HTML, and allow alternatives. That’s why I think it’s good to think about other formats.

Pygments has different formatters as well as noted earlier, maybe RTF is a good one to try out if you don’t want to into (virtual) DOMs?

if someone was building an editor or doing selective updates though that's a whole other ballpark and a very custom use case

I think it’s not really a weird case, as it was mentioned in this thread a couple of times over the years.

joshgoebel commented 5 years ago

Well sure. But only so much time. In this case I think the problems here are pretty well understood conceptually and can be solved abstractly. Any type of alternative renderer needs a simple parse tree to walk, regardless if the target was Elm or a printed PDF. The goal (in my mind) is to build something people could reasonably build on top of. I can throw something together and push a beta. Someone else in each of those domains needs to come out of the woodwork to test and see how well it works for them.for example you could test it with your thing to see how easy it is to work with. Etc.

And even if it’s not helpful at all I think it’ll still make the code easier to understand because it woks it’s two huge concerns that are currently very intermingled but shouldn’t be.

Pygments has different formatters as well as noted earlier, maybe RTF is a good one to try out if you don’t want to into (virtual) DOMs?

Not really a matter of want. And again a lot of this is hinged on fact that at the most all you’d need is a translation layer between the raw parse tree and a more cooked parse tree that RTF or whatever desired. It shouldn’t be much more code than the HMTL builder unless I’m really missing something.

I think it’s not really a weird case, as it was mentioned in this thread a couple of times over the years.

Other maintainers have said pretty clearly we can’t support that though. We’re designed to render static code blocks, not serve as the basis for a live code editor. Not saying it’s impossible or anything just that you’re really pushing the limits and we can’t be responsible if you blow yourself up or it doesn’t work as well as you would hope.

joshgoebel commented 5 years ago

Some type of auto diving shadow Dom thing might be cool thought for the editor people. But I think you could actually do that already with just the HTML output too so I’m not sure this would be a huge improvement in that category.

okonet commented 5 years ago

maybe RTF is a good one to try out if you don’t want to into (virtual) DOMs?

I actually implemented RTF formatter using highlight.js and ace.js and could share that (it's OSS). Check out https://github.com/okonet/codestage/tree/master/lib it's decoupled from the implementation but relies on HTML now.

The worst and hardest problem when using HTML output as the input is to match CSS selectors. I had to pull in JSDOM just to do that job since the actual implementation is quite complex.

If I'd had an AST of the highlighted source, it would be super straight-forward to implement. The actual RTF renderer is just this: https://github.com/okonet/codestage/blob/5cf13e2ba80188b0bfbead8f6d0626e4839c0afa/lib/src/index.js#L76-L110

joshgoebel commented 5 years ago

The worst and hardest problem when using HTML output as the input is to match CSS selectors.

Aren't they just simple strings? I mean you'd have to parse the HTML again to get nodes to walk, but once you had those it's just "hljs-variable" or such, no? In the AST they'd be "variable"... same thing without the prefix.

okonet commented 5 years ago

Yes, but I don’t need nodes, I need computed styles of the node to transform it to RTF instructions. Matching nodes and getting computed styles are quite hard if you’re running outside of the browser. That’s why the most reliable way was to stick with JSDOM. The simpler approach worked for simple themes (one selector per node), but not for more complex ones. Also some themes are relying on CSS cascade so simple matching using 2 ASTs (html and css) didn’t work 🤷🏻‍♂️

joshgoebel commented 5 years ago

Oh. I get it. You have to write a render engine too with RTF to figure out style and color. That’s probably impossible to do fully without parsing CSS fully and all it’s insanity. Maybe JSDOM does that accurately though? Are there no simple HTML to RTF thingies?

joshgoebel commented 5 years ago

I don’t see how the AST we’d build here would help you then. It would only include classes. You’d still have to figure out how CSS would apply those classes dynamically in the real world, no?

okonet commented 5 years ago

Maybe you’re right. I’d hope the AST would not include classes but rather semantics of the tokens. The renderer would then either apply class names to Dom nodes or use a different technique like RTF instructions. Having class names in the AST won’t help, yes.

okonet commented 5 years ago

JSDOM is the best you can get but I’m wondering if relying on it is the right approach. To me, since this project does parsing and “knows” what nodes are representing, why should it rely on DOM at all. This is a rendering target to me. That said, I realize this would probably apply some limitations on current theming approach and will definitely be a breaking change. But I’d be glad to assist in any way with the RTF renderer or anything else if that’s interesting

wooorm commented 5 years ago

@okonet You seem to be interesting in the link between CSS and the HTML nodes, right? Which I don’t think is solved by this issue/PR indeed: even when tokens here wouldn’t represent the DOM, they still aren’t “linked” to their styles

That’s a hard problem! 🤷‍♂️

joshgoebel commented 5 years ago

I’d hope the AST would not include classes but rather semantics of the tokens.

All we have are class names, that's all the semantics we have - the styles from the styling. If you had a RTF or something you'd have to define your own "style" and then decide what each of the classes represented - we do have a fixed and documented list.

To me, since this project does parsing and “knows” what nodes are representing, why should it rely on DOM at all.

We don't, we only use it for some of the tests to sort of make sure the "browser" payload seems to work in a browser like environment. I don't think I was suggesting it in any other usage. We're pretty happy with having 0 requirements for the client-side I think.

You wouldn't need it if you just consumed the parse-tree directly (and define your own styles).

But I’d be glad to assist in any way with the RTF renderer or anything else if that’s interesting

Sounds like a great plug-in idea to me.

Which I don’t think is solved by this issue/PR indeed: even when tokens here wouldn’t represent the DOM, they still aren’t “linked” to their styles

Yeah, you'd just have class, not actual styling... so you'd have one or two "paper" themes (not sure what other targets you'd use RTF fun, but could be my lack of knowledge here)... I think most of our themes wouldn't look very good on paper in any case.

joshgoebel commented 4 years ago

Latest update:

https://github.com/highlightjs/highlight.js/pull/2404

joshgoebel commented 4 years ago

This is merged into master now. I wouldn't call this a 100% public API just yet but you can get the parse tree object now with:

result.emitter.root

Or consider the walk API, Ex: result.emitter.walk(builder). Check out token_tree.js. You can also replace the whole emitter with:

configure({__emitter: CustomEmitter });

Though that's definitely flagged as beta. It seems a reasonable extension point to me though, hence trying it out. Again, see token_tree.js for rough notes on the emitter API.

Very curious to know if that helps any of you. CC @wooorm

wooorm commented 4 years ago

Awesome, I'll check it out! While its beta, can we somehow have a contract that it would only change in minor / major versions, and not in patches?

wooorm commented 4 years ago

It works well, it’s a bit hard to figure out without docs, and https://github.com/highlightjs/highlight.js/issues/2522 and https://github.com/highlightjs/highlight.js/issues/2523 are blockers for upgrading lowlight. Thanks @yyyc514.

What I’m missing is a way to extend highlight.js inside lowlight with options, while not affecting highlight.js for other consumers (e.g., someone expecting a string). Right now I must get the current configuration, then configure, and the finally configure the previous configuration again.

joshgoebel commented 4 years ago

it’s a bit hard to figure out without docs

Well docs are always better, sure, but I did try to write the code and APIs clearly... if you have any specific thoughts on that I'd love to hear them.

I think all you'd need to understand is the emitter API (it's a few methods) and then the tree structure itself... see it's toJSON method. You either want to be an emitter yourself or you want to walk the tree afterwards.

joshgoebel commented 4 years ago
const instance = HLJS({});
instance.newInstance = HLJS.bind(null, {});

// export an "instance" of the highlighter
export default instance;

Would this get the job done? Just call hljs.newInstance() for a completely isolated instance of the run-time.

joshgoebel commented 4 years ago

OR if you're building from source just pull in the RAW ES6 modules and then call HLJS({}) yourself to get your own instance... I'm not sure if using the built version of the libraries is a MUST for what you're doing or not.