Closed michaelburch closed 2 years ago
Thanks for the repro - I can replicate on the Statiq examples page too as you noted. Looking into this now.
Note to self for context: this corresponds with Statiq Framework 1.0.0-beta.50, which is when all the Statiq.Html modules were moved into core. It's likely some behavior of AngleSharp may have regressed at that point (not in AngleSharp directly, more like in how it's being used).
This one is getting interesting. I actually can't reproduce with the raw GenerateExcerpt
module - it's leaving the HTML in the except exactly as it sees it, including quotes and other escapable content. Likewise, Html.Raw()
is still working as expected too (I.e. it's not doing the escaping).
One possibility is that Markdig is actually doing the encoding before the excerpt is even generated. It was updated around the time this problem started. But that doesn't make complete sense either because even if it were the case, I'd expect the entity encoding just to flow right through and be rendered correctly in the browser. It's like it's being double-encoded (or least I'll guess the ampersand is).
Still investigating, but the easiest answer that it's the excerpt module appears to be out. It's likely some combination of modules in the Statiq Web pipeline, so I'll need to do some integration testing to get to the bottom of it. More to come.
So my first hunch was correct, and is at least partially responsible - the RenderMarkdown
module (and thus Markdig) is encoding the quotes when it renders Markdown content:
So when the document gets to the GenerateExcerpt
module it's okay and contains encoded quotes, but that's valid HTML:
But then by the time AngleSharp has parsed the HTML content inside GenerateExcerpt
to find the excerpt content, we've double-escaped the ampersand:
Now that I know where the problem is, it should be fairly simple to fix.
...and now I know why it's happening and changed. This is an unfortunate regression caused by my attempts to deal with an annoying problem with @
encoding. The Razor engine uses @
as the delimiter for C# instructions. So sometimes we want @
to be a literal. But other times, like when I use @
inside a Markdown document for something like an email address or Twitter handle, we don't want @
to be a literal because when Razor gets it, it'll interpret that as an instruction delimiter. So in those cases we have to encode the @
. And there's the problem: some @
are encoded and others aren't. To preserve which is which when we need to do DOM processing with AngleSharp (like getting an excerpt), I told AngleSharp not to "consume" character references and to treat them like text. But then AngleSharp gets all smart and sees the &
of a character reference, says "oh, this was just text so I need to encode that &
, and does so. And so we end up with double encoding.
(BTW - I know that was a lot, just wanted to document what's going on in case I ever end up back here)
Fix confirmed:
I'll get a release out sometime this weekend. Thanks again for reporting this, turns out to have been a pretty major bug lurking around in the background!
When using
@Html.Raw(document.GetString("Excerpt"))
to display excerpt content on an archive page, as in the simple-archive example HTML quotes are displayed as their encoded value,"
.This began with Statiq.Web 1.0.0-beta.35 and continues today with 1.0.0-beta.42.
Repro here: https://github.com/michaelburch/Statiq.Web
Example when using Statiq.Web 1.0.0-beta.34:
Example when using Statiq.Web 1.0.0-beta.35+: