pote / planet.rb

A feed aggregator implementation intended to be used with Octopress
MIT License
66 stars 19 forks source link

Switch to `reverse_markdown` instead of `sanitize_html` #55

Open ryanakca opened 8 years ago

ryanakca commented 8 years ago

Using sanitize_html was causing formatting issues on imported content with non-trivial formatting. For example, simply dropping tags caused blocks of the form

<noscript>
    some blah blah blah
</noscript>

to be treated and formatted as code blocks, because markdown causes indented content to be parsed as a code block. Similarly, when a post's entire content is stored on a single line in the RSS feed, this causes unusual rendering side effects when tags are deleted, again due to Markdown's convention of treating continuous blocks of text as paragraphs.

Instead, we use reverse_markdown to convert the HTML to Markdown, thereby preserving the content one cares about (bold/italic, images, tables, lists, links, etc.), while stripping out the undesirable content (scripts, CSS, etc.). This has so far proven to be more reliable than the previous HTML sanitization. It has the added benefit that the entire site, including the aggregated content, is styled identically.

sauron commented 8 years ago

Hey @ryanakca,

Thank you for the pull request. After reading and taking a look, I must say that the only thing that worries me is the fact that we are forcing Markdown. While it may sound as good option given that most of us(programmers) are use to it, it may no be as universal as HTML. What do you about adding it as an option?

ryanakca commented 8 years ago

Hi @sauron,

I'm not particularly sure why universality comes in, since nobody should be editing the Markdown or HTML in the post files planet spits out (these changes only affect files generated by planet). At least, as far as I can tell, any changes made to the files under posts_directory get overwritten by the next run of planet generate. And so the choice of Markdown vs HTML should really be thought of as a question of what storage format should be used for read-only content. And I would argue that Markdown is much more readable (should anybody want to read the generated post files) than spaghetti HTML.

However, if you'd like to make it an option in the name of backwards compatibility, please feel free :)

sauron commented 8 years ago

@ryanakca, You are completely right. It is only for internal use. I've mentioned that because I've found myself verifying what was generated on the Browser. But that shouldn't happen to anyone, anymore, once I finally push my branch with the whole test suite. As you mentioned, the backward compatibility is the only drawback. I'll finalize the tests and release everything in version 1.0

Can you create a gist with an example of the post that was being created in wrong way? That will help me to finish the Test considering this scenario.

Thank you.