Open daveaglick opened 5 years ago
Turns out this wasn't Markdig at all. Looks like the problem is isolated to the docs recipe and is caused by AutoLink
(so possibly AngleSharp related).
Using AngleSharp, this code:
HtmlParser parser = new HtmlParser();
using (Stream stream = new MemoryStream(Encoding.UTF8.GetBytes(@"<html><head></head><body><?# foo /?></body></html>")))
{
IHtmlDocument htmlDocument = parser.Parse(stream);
using (StringWriter writer = new StringWriter())
{
htmlDocument.ToHtml(writer, HtmlMarkupFormatter.Instance);
writer.Flush();
writer.ToString().Dump();
}
}
produces this output:
<html><head></head><body><!--?# foo /?--></body></html>
Looks like this behavior is related to https://github.com/AngleSharp/AngleSharp/issues/609
Specifically, @FlorianRappl comment which relates to IE conditional comments, but probably also applies to XML processing instructions in the HTML:
Conditional comments are IE only constructs and not specified. As such fully HTML5 compliant parsers will parse them like that.
Which makes sense standards-wise but doesn't help get the shortcodes through template processing. Going to need to figure out a way to preserve them when doing HTML manipulation with AngleSharp.
This is a general problem. You have HTML5-invalid markup and want it to be HTML5 parsed (hence the HTML5 error correction steps in and takes over). I think there are at least 2 ways out:
Not sure if the latter is possible (seems like these are some fixed constructs).
Maybe we could also hack in (optionally available) processing instructions into AngleSharp. They would be disabled by default.
Happy to receive PRs on the topic!
Thanks for the quick response @FlorianRappl! The behavior makes sense now that I understand what's going on. Even though they're valid SGML and XML, processing instructions aren't indicated in the HTML5 spec. The syntax is actually arbitrary - I chose one that looks like processing instructions because it needs to "fall through" various template engines and it's enough of a gray area standards-wise that specs like CommonMark even have specific rules about them.
I also agree with your mitigation suggestions.
<!--?
and ?-->
for <?
and ?>
. That seems relatively risk-free since I don't envision that syntax showing up legitimately (but I guess you never know). I can also implement it quickly, which is important since this is out in the wild now. Maybe a good first step until a more correct solution is ready.ProcessingInstruction
node and I can create and add processing instructions to the document with IHtmlDocument.CreateProcessingInstruction()
(which interestingly output uncommented) so hopefully it's just a matter of adding parsing support with an option to turn on.I'll create a new issue in AngleSharp as a feature request to document that I'm working on it.
Essentially makes shortcodes impossible to use on a page with either code fences or inline backticks