poppastring / dasblog-core

The original DasBlog reimagined with ASP.NET Core
MIT License
473 stars 199 forks source link

HTML Content not being displayed correctly in blog post #544

Closed manzanotti closed 3 years ago

manzanotti commented 3 years ago

I've create a blog post using OLW, and it contained some HTML for rendering code in a pleasing manner.

I originally guessed that because post content is stored as XML, and the html is encoded to enable that, when the post is converted to be displayed on the page, that also converts any encoded html in the post.

However, debugging through the code, this isn't the case. The content is being correctly decoded, and I can see that encoded HTML in my post is still in that state (e.g. &lt;div). But when I view the page, the div isn't in the code block, and if I look at the page source, it appears as <div. Additionally, if I look at the DOM in the Dev Tools, it has turned that div into an HTML element.

To make things even more confusing, if I edit the post, TinyMCE both shows the HTML correctly, and has it as &lt;div in its View Source mode.

Any ideas as to what is happening here, and how I can fix it?

manzanotti commented 3 years ago

Ok, done a bit more digging, and ended up in the PostContentTagHelper class, and the first thing that class does is:

HttpUtility.HtmlDecode(Post.Content);

which, obviously, takes any HTML-encoded text and turns it into HTML elements. This is what is turning my HTML-encoded code into actual HTML.

I must admit to not understanding why this is being done. Post.Content already appears to have decoded the content from the XML file, so it looks like valid HTML to me.

I'm assuming that there's a good reason for this, but I'll admit that it's eluding me at the moment. I've tried removing the HtmlDecode, and the blog post renders correctly (and hasn't decoded my intentionally HTML-encoded text).

So, the question is, why is the post content being HTML-decoded here?

poppastring commented 3 years ago

@manzanotti

Just so that I am clear it sounds like you are attempting to display sample code, possibly HTML, CSS or some other language and are looking to ensure that when it is saved and subsequently displayed it does not get translated by the parser so that it just becomes regular HTML text.

Ok so the fundamental premise of dasblog is that we store everything in XML format which means that we have ensure that all text is XML safe. For example "<" gets translated to &lt;. Unfortunately that means everything including html you intend to just simply output as is, this tends to make html, css, javascript, etc. really hard to output as non html.

Most blogging systems including dasblog-core expect devs to use a combination of <pre> and <code> tags (along with good dose of supporting css and javascript) to get what you want in this situation.

I have been checking out EnlighterJS as great option (although I write less code these days): https://github.com/EnlighterJS/EnlighterJS/

EnlighterJS, an open source syntax highlighter written in javascript, basically it lets you write a blog post and highlight text that you want to assume as code HTML, C++ or otherwise. No matter what translation dasblog-core does it will look for anything in the <pre> and <code> blocks and attempt to make it look like the code you want.

I hope this helps.

manzanotti commented 3 years ago

Apologies, there was an intermediate post that I deleted once I worked that last post out, and I realise that I missed out a bunch of information!

Yes, I am attempting to show some HTML in my blog post. I'm using a site called http://hilite.me/ to turn my html into highlighted html that can be displayed on a web page. Here's an example:

<!-- HTML generated using hilite.me -->
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
    <pre style="margin: 0; line-height: 125%">
        <span style="color: #007700">&lt;div</span>
        <span style="color: #0000CC"> id=</span>
        <span style="background-color: #fff0f0">&quot;testId&quot;</span>
        <span style="color: #007700">&gt;</span>
        This is a test<span style="color: #007700">&lt;/div&gt;</span>
    </pre>
</div>

You can see that it is using a pre tag here. This gets HTML-encoded when serialised to XML:

&lt;!-- HTML generated using hilite.me --&gt;&lt;div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"&gt;&lt;pre style="margin: 0; line-height: 125%"&gt;&lt;span style="color: #007700"&gt;&amp;lt;div&lt;/span&gt; &lt;span style="color: #0000CC"&gt;id=&lt;/span&gt;&lt;span style="background-color: #fff0f0"&gt;&amp;quot;testId&amp;quot;&lt;/span&gt;&lt;span style="color: #007700"&gt;&amp;gt;&lt;/span&gt;This is a test&lt;span style="color: #007700"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

So my Html-encoded <div is serialised as &lt;div, When I put a breakpoint in the PostContentTagHelper class and load the post, Post.Content looks like this:

<!-- HTML generated using hilite.me -->
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
    <pre style="margin: 0; line-height: 125%">
        <span style="color: #007700">&lt;div</span>
        <span style="color: #0000CC"> id=</span>
        <span style="background-color: #fff0f0">&quot;testId&quot;</span>
        <span style="color: #007700">&gt;</span>
        This is a test<span style="color: #007700">&lt;/div&gt;</span>
    </pre>
</div>

So you can see that HTML content that I want displayed is still HTML-encoded.

The method then continues, and puts Post.Content through the HtmlUtility.HtmlDecode method, after which content looks like this:

<!-- HTML generated using hilite.me -->
<div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;">
    <pre style="margin: 0; line-height: 125%">
        <span style="color: #007700"><div</span>
        <span style="color: #0000CC">vid=</span>
        <span style="background-color: #fff0f0">"testId"</span>
        <span style="color: #007700">></span>
        This is a test<span style="color: #007700"></div></span>
    </pre>
</div>

And now my HTML-encoded text has been decoded into HTML, and the browser treats it as an html element.

So, to me, it looks like the sequence of events is:

  1. &lt;div - original input
  2. &amp;lt;div - serialised into XML
  3. &lt;div - deserialised from XML
  4. <div - second Html-decode

It's point 4 that appears to be causing the problem. As I said, I couldn't work out what that call to HtmlDecode is actually there for, as Post.Content has already been Html-decoded when deserialising the XML.

I had started checking out Prism.js, but I'll have a look at Enlighter too. However, it'd be great to know if that HtmlDecode is truly needed, or whether it's a bug.

Hope that all makes sense now!

poppastring commented 3 years ago

@manzanotti I have looked at this from multiple angles and I believe you are correct. It appears that I am HtmlDecoding twice which for posts that contain HTML can create problems and cause the page rendering to fail.

I am going to go ahead and create this PR as I want to test it today and roll it back if I find any issues.

@shanselman FYI on this thread just in case I have forgotten something fundamental that we inherited from DasBlog

manzanotti commented 3 years ago

Glad to help out, and not to be going mad!

poppastring commented 3 years ago

I plan to add an option that continues to allow double decoding (not default). This may be one of those scenarios where our obligation to the past and desire for the future requires work to bring into alignment.

This will require a site.config and associated classes, and in turn updates to the Admin page.

shanselman commented 3 years ago

@poppastring let's test this in hanselman-prod next week when we meet? (unless you want to meet sooner) as this feels weird to me, and given I use precode in wlw and this has and does work TODAY I'm very concerned this will break something.

poppastring commented 3 years ago

@shanselman My results are certainly mixed and I have no intentions of changing 15 year old blog posts. I am fine with reverting this, our obligation would be to create guidance on a good way for folks to add lines of code in a blog post.

manzanotti commented 3 years ago

@shanselman You're right, this issue feels wrong. If you get a chance, could you possibly attach an xml file of one of your posts with code in it? I can then run that locally, and just see if I'm catching some weird edge case.

shanselman commented 3 years ago

Here you go

content.zip

manzanotti commented 3 years ago

@shanselman Thanks for those, however this issue is purely around displaying HTML in a post, rather than C# and Javascript.

I've searched your blog for a suitable example, and is there any chance you could send me the XML file for this post?

https://www.hanselman.com/blog/email-signature-etiquette-with-outlook-2007-appropriate-flair

It was posted 2007-14-10, if that makes it easier to find?

shanselman commented 3 years ago

Here you go 2007-04-10.dayentry.xml.zip

poppastring commented 3 years ago

FYI: @shanselman @manzanotti

Just to bring this to a close I have rolled back the option to remove Double Decoding.

The recommendation is to use a tool like EnlighterJS or SyntaxHighlighter. Allow this to do the highlighting for you, it will help reduce the complexity of attempting to mix html and non html inside a <pre> or <code>.

manzanotti commented 3 years ago

@poppastring Sorry, I have been looking into the issue, but the kids are off school for the Easter holiday, and I haven't been able to get to the bottom of it.

I've taken the XML content from @shanselman, and it does not display correctly for me, because of the double decoding issue. On the one hand, that makes me happy, as I have no idea how any HTML could be displayed as code with the double decoding going on. But on the other hand, @shanselman does not have this issue on his site.

Scott's:

Hanselman blog render

From the page source:

<pre>&lt;a href="<a href="http://feeds.feedburner.com/ScottHanselman&quot;">http://feeds.feedburner.com/ScottHanselman"</a>&gt;<br>  &lt;img border="0" alt="Scott Hanselman's Blog" <br>   src="<a href="http://feeds.feedburner.com/ScottHanselman.gif&quot;&gt;">http://feeds.feedburner.com/ScottHanselman.gif"&gt;</a><br>&lt;/a&gt;</pre>

Mine:

My blog render - double decode

<pre><a href="<a href="http://feeds.feedburner.com/ScottHanselman"">http://feeds.feedburner.com/ScottHanselman"</a>><br>  <img border="0" alt="Scott Hanselman's Blog" <br>   src="<a href="http://feeds.feedburner.com/ScottHanselman.gif">">http://feeds.feedburner.com/ScottHanselman.gif"></a><br></a></pre>

Somehow, on Scott's site the encoded HTML gets through the double decoding still as encoded HTML, whereas the same XML doesn't on mine.

So, I'm left with the conclusion that there is something strange with my setup, though given that I get the issue both on my dev laptop and on my hosted blog, I'm at a complete loss as to how I go about finding what that issue is!

Any suggestions as to what I can do to investigate this discrepancy further? It's really piqued my interest now (well, and is annoying me), so I'd really love to discover what is going on with it!

poppastring commented 3 years ago

@manzanotti Scott relies on a specific code highlighter that is able to handle inline <br> tags, so I am not sure that is a good example.

Scott and I sat down for a chat and I get how we historically ended up here and it kind of makes sense now. This weekend I updated the wiki to hopefully make it a little easier to include your own code in a blog post. Hope this helps.

manzanotti commented 3 years ago

Sorry to keep going on about this, but I still don't understand what is going on here, which means that this has the potential to be a learning experience!

I've looked at the example that you've posted on the wiki, and what I don't understand is that Scott's XML post does not contain the double HTML-encoded text that your suggested method would result in.

With regard to that being down to the code highlighter Scott is using, whilst you could be right about that, I've been looking at the old SyntaxHighlighter site using the Wayback Machine, and it states that SyntaxHighlighter can only work with escaped HTML. With the double-decode, I am at a loss to explain how Scott's blog is generating escaped HTML from that XML.

Are you able to expand on what the double decode is in the codebase to achieve, given that you now have the historical context? As, so far admittedly, all the evidence I've accrued suggests that the double-decode shouldn't be needed, so I've clearly missed something!

To be clear, this isn't a big issue. I'm happy to just remove the double-decode from my version of the code (having to manually double-encode HTML would get frustrating pretty quickly), but it would be great to get a proper understanding of the situation!

shanselman commented 3 years ago

The issue is that my stuff is inside a

 and the pre decodes it. If
your stuff isn't in a pre then you'll have a weird experience. Pres are
special.

https://stackoverflow.com/a/13010144

On Tue, Apr 20, 2021 at 4:06 AM Paul Manzotti @.***> wrote:

Sorry to keep going on about this, but I still don't understand what is going on here, which means that this has the potential to be a learning experience!

I've looked at the example that you've posted on the wiki, and what I don't understand is that Scott's XML post does not contain the double HTML-encoded text that your suggested method would result in.

With regard to that being down to the code highlighter Scott is using, whilst you could be right about that, I've been looking at the old SyntaxHighlighter site https://web.archive.org/web/20180105064708/http://alexgorbatchev.com/SyntaxHighlighter/manual/installation.html using the Wayback Machine, and it states that SyntaxHighlighter can only work with escaped HTML. With the double-decode, I am at a loss to explain how Scott's blog is generating escaped HTML from that XML.

Are you able to expand on what the double decode is in the codebase to achieve, given that you now have the historical context? As, so far admittedly, all the evidence I've accrued suggests that the double-decode shouldn't be needed, so I've clearly missed something!

To be clear, this isn't a big issue. I'm happy to just remove the double-decode from my version of the code (having to manually double-encode HTML would get frustrating pretty quickly), but it would be great to get a proper understanding of the situation!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/poppastring/dasblog-core/issues/544#issuecomment-823186355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAWTG2XDVTOR6QHNDOCXDTJVN5FANCNFSM4Z3OKDMQ .

manzanotti commented 3 years ago

The issue is that my stuff is inside a <pre> and the pre decodes it. If your stuff isn't in a pre then you'll have a weird experience. Pres are special. https://stackoverflow.com/a/13010144

I'm reasonably certain that's not it; I'm using pre tags, but the double-decode that Das Blog does of the post content means that there is no escaped HTML inside the pre tag to decode. It's just actual HTML at that point, and the Chromium engine parses it and turns it into elements in the DOM.

Using the example of the Stack Overflow answer you posted, if I put the content of the second example into the source of a blog post, the double-decode Das Blog does renders it to the page as the first example.

And this happens to the content you kindly zipped up and posted here, both on my dev laptop and my actual blog. But not on your blog. Hence my brain exploding! :)

poppastring commented 3 years ago

@manzanotti

Are you saying the Publishing Code Snippets does not work? I ran through the steps with your span example above and it appears to work for me (attached my example).

If alternatively you are saying that you are not sure how Scott's post is working, I would posit that is due to Scott's syntax highlighter being much better than most I have seen. I actually used a version that highlighter a long time ago it understands <br> and all sorts of other things inside the <pre>, most do not.

Either way this issue has been closed because we made a decision about what we would explicitly support going forward. Ultimately you can use any alternate method that works for you (including inline styles) but we will specifically test using this encoding method.

I really hope this helps.

2021-04-22.dayentry.zip