voussoir / timesearch

The subreddit archiver
BSD 3-Clause "New" or "Revised" License
172 stars 7 forks source link

Offline Reading <> Invalidates HTML #4

Closed WAUthethird closed 5 years ago

WAUthethird commented 5 years ago

Hey there, it's me again. When I compile a subreddit using the offline_reading functionality, there always seems to be errors whenever I merge all the files into one (using something like HTML Merge). Specifically, these errors refer to the < and > signs, as HTML uses them quite heavily.

Would it be possible for offline_reading to replace < and > with &lt; and &gt;? These codes do work correctly without errors, and would fix a lot of problems I've been having, along with keeping it WYSIWYG. (it would also help with certain instances, where, for example, someone puts <i into their comment/submission, and the HTML renderer just renders the rest of the entire document italicized)

Thanks!

voussoir commented 5 years ago

Hey again

Oops, I assumed the markdown library would escape the text content. I'm surprised but it's my fault for not testing that.

I took a look at some options for escaping the text. I saw Python has an html module so I could html.escape the text before running the markdown, but then that breaks html entities which are okay to use like nbsp. So for now let's try just replacing lt and rt and see if that's sufficient.

Commit: https://github.com/voussoir/timesearch/commit/871a56dd81926502ac26e0bc40b4d627cda69b16

Thanks for bringing this up! You may close this issue if that fixes it for you.

WAUthethird commented 5 years ago

Thanks a lot, that works perfectly!

WAUthethird commented 5 years ago

Mainly it's just unsupported encoding errors causing any problems now - none caused by this software, of course. Amazon links seem to be the most troublesome when it comes to this, but it's not too hard of a fix.