quotes not rendered correctly in excerpt

michaelburch commented 2 years ago

When using @Html.Raw(document.GetString("Excerpt")) to display excerpt content on an archive page, as in the simple-archive example HTML quotes are displayed as their encoded value, &quot.

This began with Statiq.Web 1.0.0-beta.35 and continues today with 1.0.0-beta.42.

Repro here: https://github.com/michaelburch/Statiq.Web

Example when using Statiq.Web 1.0.0-beta.34:

simple-archive-beta 34

Example when using Statiq.Web 1.0.0-beta.35+:

simple-archive-beta 35

daveaglick commented 2 years ago

Thanks for the repro - I can replicate on the Statiq examples page too as you noted. Looking into this now.

Note to self for context: this corresponds with Statiq Framework 1.0.0-beta.50, which is when all the Statiq.Html modules were moved into core. It's likely some behavior of AngleSharp may have regressed at that point (not in AngleSharp directly, more like in how it's being used).

daveaglick commented 2 years ago

This one is getting interesting. I actually can't reproduce with the raw GenerateExcerpt module - it's leaving the HTML in the except exactly as it sees it, including quotes and other escapable content. Likewise, Html.Raw() is still working as expected too (I.e. it's not doing the escaping).

One possibility is that Markdig is actually doing the encoding before the excerpt is even generated. It was updated around the time this problem started. But that doesn't make complete sense either because even if it were the case, I'd expect the entity encoding just to flow right through and be rendered correctly in the browser. It's like it's being double-encoded (or least I'll guess the ampersand is).

Still investigating, but the easiest answer that it's the excerpt module appears to be out. It's likely some combination of modules in the Statiq Web pipeline, so I'll need to do some integration testing to get to the bottom of it. More to come.

daveaglick commented 2 years ago

So my first hunch was correct, and is at least partially responsible - the RenderMarkdown module (and thus Markdig) is encoding the quotes when it renders Markdown content:

So when the document gets to the GenerateExcerpt module it's okay and contains encoded quotes, but that's valid HTML:

But then by the time AngleSharp has parsed the HTML content inside GenerateExcerpt to find the excerpt content, we've double-escaped the ampersand:

Now that I know where the problem is, it should be fairly simple to fix.

daveaglick commented 2 years ago

...and now I know why it's happening and changed. This is an unfortunate regression caused by my attempts to deal with an annoying problem with @ encoding. The Razor engine uses @ as the delimiter for C# instructions. So sometimes we want @ to be a literal. But other times, like when I use @ inside a Markdown document for something like an email address or Twitter handle, we don't want @ to be a literal because when Razor gets it, it'll interpret that as an instruction delimiter. So in those cases we have to encode the @. And there's the problem: some @ are encoded and others aren't. To preserve which is which when we need to do DOM processing with AngleSharp (like getting an excerpt), I told AngleSharp not to "consume" character references and to treat them like text. But then AngleSharp gets all smart and sees the & of a character reference, says "oh, this was just text so I need to encode that &, and does so. And so we end up with double encoding.

(BTW - I know that was a lot, just wanted to document what's going on in case I ever end up back here)

daveaglick commented 2 years ago

Fix confirmed:

I'll get a release out sometime this weekend. Thanks again for reporting this, turns out to have been a pretty major bug lurking around in the background!

statiqdev / Statiq.Web

quotes not rendered correctly in excerpt #981