Shortcodes/XML processing instructions escaped when code is on the page in docs recipe

daveaglick commented 5 years ago

Essentially makes shortcodes impossible to use on a page with either code fences or inline backticks

daveaglick commented 5 years ago

Turns out this wasn't Markdig at all. Looks like the problem is isolated to the docs recipe and is caused by AutoLink (so possibly AngleSharp related).

daveaglick commented 5 years ago

Using AngleSharp, this code:

HtmlParser parser = new HtmlParser();
using (Stream stream = new MemoryStream(Encoding.UTF8.GetBytes(@"<html><head></head><body><?# foo /?></body></html>")))
{
    IHtmlDocument htmlDocument = parser.Parse(stream);
    using (StringWriter writer = new StringWriter())
    {
        htmlDocument.ToHtml(writer, HtmlMarkupFormatter.Instance);
        writer.Flush();
        writer.ToString().Dump();
    }
}

produces this output:

<html><head></head><body><!--?# foo /?--></body></html>

daveaglick commented 5 years ago

Looks like this behavior is related to https://github.com/AngleSharp/AngleSharp/issues/609

Specifically, @FlorianRappl comment which relates to IE conditional comments, but probably also applies to XML processing instructions in the HTML:

Conditional comments are IE only constructs and not specified. As such fully HTML5 compliant parsers will parse them like that.

Which makes sense standards-wise but doesn't help get the shortcodes through template processing. Going to need to figure out a way to preserve them when doing HTML manipulation with AngleSharp.

FlorianRappl commented 5 years ago

This is a general problem. You have HTML5-invalid markup and want it to be HTML5 parsed (hence the HTML5 error correction steps in and takes over). I think there are at least 2 ways out:

Do some preprocessing and potential postprocessing (e.g., strap them out / replace them with a comment field, and later strap them back in)
Change the syntax to be compatible with HTML5, e.g., using custom components

Not sure if the latter is possible (seems like these are some fixed constructs).

Maybe we could also hack in (optionally available) processing instructions into AngleSharp. They would be disabled by default.

Happy to receive PRs on the topic!

daveaglick commented 5 years ago

Thanks for the quick response @FlorianRappl! The behavior makes sense now that I understand what's going on. Even though they're valid SGML and XML, processing instructions aren't indicated in the HTML5 spec. The syntax is actually arbitrary - I chose one that looks like processing instructions because it needs to "fall through" various template engines and it's enough of a gray area standards-wise that specs like CommonMark even have specific rules about them.

I also agree with your mitigation suggestions.

The easiest thing to do would be to perform a simple text replace after processing, swapping  for <? and ?>. That seems relatively risk-free since I don't envision that syntax showing up legitimately (but I guess you never know). I can also implement it quickly, which is important since this is out in the wild now. Maybe a good first step until a more correct solution is ready.
I considered using something that looks more like HTML initially but discounted it because I was worried one or another template engine would get a little handsy with the syntax. I also didn't want shortcodes as a concept to be too tightly tied to HTML - they can be used outside HTML (such as in plain text documents) and aesthetically the processing instruction syntax looks more general-purpose.
I think a better long-term solution is to build-in support for parsing processing instructions within AngleSharp. Even though they're not technically valid HTML, I think as a special case they're ambiguous enough that it makes sense to have that as an option. And more pragmatically, they're used in various web publishing pipelines where AngleSharp might be used and being able to pass them through adds value in these scenarios. For example, you could pre-process PHP files with AngleSharp if this were available. I'll take a look at implementing this. I see there's already a ProcessingInstruction node and I can create and add processing instructions to the document with IHtmlDocument.CreateProcessingInstruction() (which interestingly output uncommented) so hopefully it's just a matter of adding parsing support with an option to turn on.

I'll create a new issue in AngleSharp as a feature request to document that I'm working on it.

statiqdev / Statiq.Docs

Shortcodes/XML processing instructions escaped when code is on the page in docs recipe #19