Strip HTML tags (but keep any text content) when rendering text

mntn-xyz commented 3 years ago

Fixes #6

mntn-xyz commented 3 years ago

I changed it a bit to ensure that text content inside HTML blocks is rendered, even though blocks are currently rendered with tags and all. Apparently gomarkdown does not strip the tags from the content with HTML blocks, it only does this for span elements. (It actually sets the value of Literal to the full text content, including tags, and then nulls out Content.)

I think we could just strip out the tags from the block content using https://github.com/grokify/html-strip-tags-go; the issue of untrusted data is not important since clients should not be rendering HTML tags or JavaScript for gemtext! bluemonday could be used if additional sanitization is desired, but this is a heavier solution.

tdemin commented 3 years ago

I am not sure if this is how it should behave..?

Source file for reproduction:

# Blockquote test

> Testing text with an <b>HTML</b> tag.
> Another line of <pre>testing text</pre>.

<code>Entire paragraph.</code>

<p>Paragraph 2.</p>

Test of <b>inline spans</b>.

> <code>Line of text.</code>

Test with fix to #6

mntn-xyz commented 3 years ago

I went ahead and implemented tag stripping for HTML blocks using html-strip-tags-go. I also made methods for HTMLBlock and HTMLSpan for consistency, and because I've got some ideas for them later (namely detecting tags like sup/sub and converting them to the proper ast type).

mntn-xyz commented 3 years ago

OK, after some deeper investigation, I've discovered the following:

HTMLBlock does not correspond exactly with HTML "block" elements. Although only HTML "block" type elements (<p>, <blockquote>, etc) will be rendered as HTMLBlocks, this alone is not sufficient. The block must also begin at the very start of a line of text (no leading spaces!) and end at the very end of a line of text. It's really just a top-level ast element like a paragraph or list; if you add something outside the block but on the same line, it becomes a paragraph containing HTMLSpans.

HTMLSpan does not correspond with HTML inline elements, it just indicates a single tag within an ast container element. Hello <b>world</b> represents a paragraph containing TWO spans.

This is a single HTMLBlock:

<p>HTML block</p>

This is also a single HTMLBlock:

<p>HTML
block</p>

This is also a single HTMLBlock!

<p>HTML block</p><p>Same HTML block!</p>

This becomes a paragraph containing two HTMLSpans:

<p>HTML span</p>Extra text

This is somewhat disappointing, as I was hoping to be able to easily get the contents of span tags and modify them as needed based on the tag, and also to skip rendering the content of some tags (<script> etc). I will implement some of this at the block level, but I'll have to just naively strip out HTMLSpans for now.

mntn-xyz commented 3 years ago

I added some tests, and it looks like there is a problem with HTML blocks inside blockquotes.

~I also need to handle tag escaping properly inside HTML blocks (\
, <br>).~ Edit: in other markdown implementations this doesn't seem to be used

mntn-xyz commented 3 years ago

Even more tests and some fixes for hard breaks. One thing I have found from testing is that we probably need to be unescaping HTML escapes like < (unescaped: <). I did this in the HTML block but it looks like it needs to be done for regular text as well. It's going to be tricky because of inline code spans and backslash escapes.

Still have to fix the issue with HTML blocks inside blockquotes. The issue is that somehow the block is being duplicated, once with tags stripped and once as a blockquote without tags stripped.

mntn-xyz commented 3 years ago

Should be done now.

tdemin / gmnhg

Strip HTML tags (but keep any text content) when rendering text #33