s9e / TextFormatter

Text formatting library that supports BBCode, HTML and other markup via plugins. Handles emoticons, censors words, automatically embeds media and more.
MIT License
232 stars 36 forks source link

Include highlight.js in markdown only configurations #206

Closed n-peugnet closed 1 year ago

n-peugnet commented 1 year ago

Currently on Flarum, when the BBCode extension is disabled and only the Markdown one is enabled, highlight.js is not loaded. This seems strange as the code blocks generated by Markdown have the right highlight-* classes. It seems the only missing piece is loading and executing the highlight JS library.

JoshyPHP commented 1 year ago

The Markdown flavour (Litedown) used in the library is meant to produce simple markup, consistent with other Markdown alternatives. The language- class is in line with the specs' suggestion that it be used to indicate which kind of code it is:

There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.

I prefer to leave it to each application to decide what kind of highlighter they want to use, if any at all. In Flarum's case, I think it'd be a good idea to copy the default BBCode's template and actually remove the script element from it and manually output a copy of it once per page. As it is, it loads/executes hljs-loader as many times as there are code blocks, which is suboptimal.

n-peugnet commented 1 year ago

I prefer to leave it to each application to decide what kind of highlighter they want to use, if any at all. In Flarum's case, I think it'd be a good idea to copy the default BBCode's template and actually remove the script element from it and manually output a copy of it once per page.

So, should this issue be moved to https://github.com/flarum/framework?

JoshyPHP commented 1 year ago

Yes. Flarum has a repository for the Markdown extension but it's read-only so as I recall, all the bug reports and feature requests go to the main repository. Add a link back to this page so they automatically link each other and I'll subscribe to the Flarum issue.

n-peugnet commented 1 year ago

To solve this issue (among other things), I made a server-side code highlighting Flarum extension.

I initially wanted to make the full highlight at the parsing stage, but I stumbled on a problematic case: when the code block is inside a markdown blockquote, the innerText named parameter contains the > prefixes, so I can't use it as the source to produce the highlighted code blocks. Instead I fell back to producing the highlighted code at rendering stage, but using a cache.

As I would rather do the highlight at parsing stage, do you have any idea if it would be possible to get in a tag filter, at parsing time, a named parameter that would only contain the inner text, stripped of any formatting or ignored content (I mean the content that gets wrapped in <i></i>, <s></s> or <e></e> tags later on)?

Also I would be interested in a (very) quick review of my extension's code if you have some time, as I may have missed some useful feature of your library (at first I reimplemented the innerText named parameter myself :facepalm:).

JoshyPHP commented 1 year ago

There's a lot to unpack and I'm not sure where to put it so I'll just dump it here.

Wrt doing things during parsing, you wouldn't be able to do it via a tag filter. At the time the filter processes the CODE tag, you don't know what's inside because the XML is being constructed and there's no telling what will be inside. All you can get is the unparsed innerText as you saw. You'd have to do that kind of thing after parsing. I would generally advise against storing HTML in the parsed text, but if you do you should store it in an attribute, assuming the size limit on DOM attributes allows it.

I think the safest, most reliable, backwards- and forward- compatible way to handle this use case would be to perform the highlighting at rendering time. Not by modifying the XML, but rather by post-processing the HTML. You can usually find code blocks using XPath with something like //pre/code but you may more efficiently perform the replacement using string functions and a regexp such as (<pre[^>]*+><code[^>]*+>\K[^<]++(?=</code></pre>)).

With that said, I think it's easier to simply leave syntax highlighting to JavaScript.

n-peugnet commented 1 year ago

Thank you for your detailed answer!

Wrt doing things during parsing, you wouldn't be able to do it via a tag filter. At the time the filter processes the CODE tag, you don't know what's inside because the XML is being constructed and there's no telling what will be inside. All you can get is the unparsed innerText as you saw.

OK, this is what I thought.

I would generally advise against storing HTML in the parsed text, but if you do you should store it in an attribute

This is indeed what I planned to do. I had the idea to use a Flarum event to process the content of the post after the parsing is fully done, which allowed me to really save the highlighted blocks in the database and leave the job to insert it based on template to TextFormatter. But I quickly realized that it inflated the comments a lot, due to the code block content being doubled and formatted using escaped HTML tags :sweat:.

I think that I will keep my current approach, which is calculating a hash for the block ID at parsing time, then highlight the block at rendering time but using a cache and passing it as an attribute for TextFormatter to insert it non-escaped.

With that said, I think it's easier to simply leave syntax highlighting to JavaScript.

It is definitely easier, but I must say I really think that the current trend of discharging a lot of processing to the client is quite sad. In this case, it means that all the readers will have to highlight all the blocks each time they read it, when it would be possible to highlight them only once on the server side.