[Discussion] Blog post contents' format

keianrao commented 4 years ago

As part of designing the models for our frontend, backend, and database tables, I've stumbled a lot on the issue of what format will be accepted for user-written content.

I quickly found out that there's basically no way for creating a blog content management system without markup. My first two choices are HTML (a markup language, the 'ML') and Markdown. They are still the only options.

The choice I was going to settle on was HTML. It gives all the capabilities needed for a rich document - formatting, headers, images, links. To render it, my plan was to edit element.innerHTML. Which is obviously a huge injection security risk, so I was to research how to sanitise HTML (on the server side).

keianrao commented 4 years ago

However, when I did, I found the following:

A rich number of ways to attempt an injection: https://stackoverflow.com/a/2702587
A section on MDN discussing the issue: https://developer.mozilla.org/en-US/docs/Web/API/Element/innerHTML#Security_considerations

Recommendations assume that you'd like to insert plaintext, in which case you can navigate to the DOM node and set its text contents directly, or try to disable all HTML tags by replacing the tag starter <. But in our case, we need rich documents, so those aren't options.

There are HTML sanitiser libraries for Python, but any of them would be an additional external dependency, and there's murmurs that they aren't truly safe.

The truly safe way is of course to parse user content manually - for a single person of my skill level, the format to parse will have to be simple..

keianrao commented 4 years ago

Side note: If we were using CGI scripts, or any server software that generates whole static pages, then the issue can be solved somewhat by disabling JavaScript using CORS headers. But, older browsers that don't know about CORS headers will still be vulnerable.

Furthermore, one of points of making this program (read: experiment) is to implement the frontend as a JavaScript application.. so even if that's normally the way to go, I won't do it.

keianrao commented 4 years ago

I think a rudimentary Markdown-to-HTML translator is possible, using several regex replaces... The troublesome parts are links and images, but if you have capture groups and lookbehinds, then you can get past them as well.

But to use regex like that is lazy hacking, not what I would call robust. Looking at the CommonMark spec, there's a great many test cases for a parser to try their hands against. I doubt a lazy regex-based replacer can pass many of them.

keianrao commented 4 years ago

Always, the simplest format to parse is DSV (delimiter-separated values).

It's actually used in one markup language - troff. If you use only basic macros or commands, then troff uncompromisingly makes you write all instructions (non-content) on a newline. Optional arguments following it separated by spaces.

It doesn't make for something very readable in its unrendered form - but it does perform. And I can make an extremely simple clone of it, with two extra macros for links and images.

keianrao commented 4 years ago

Another thing to consider is that, even if we accept some markup format and then translate it to HTML, if we then insert that finished HTML to the page using .innerHTML, there is still the risk of injection. A safer way would be to insert one-by-one, through the DOM API, each part of the user content. That may mean separate model classes, that may mean a JSON array..

keianrao commented 4 years ago

If we want to make use of libraries, we could use https://github.com/commonmark/commonmark.js, the reference JS implementation by the people behind CommonMark. It sounds like a reference, but they have no qualms showing how to import and use it, and they even have a 'safe' flag for free sanitisation of embedded HTML and some dangerous things.

It's not terribly expensive either, at 272K. That is still heavy for a page, but when browsing multiple pages, the browser will probably fetch a cached version..?

keianrao commented 4 years ago

I think I will remove the requirement for rich text blog posts, and instead go for plain text blog posts..

I don't think rolling our own solution for markup is a wise decision. And if I use a library to bring rich text to this app: users can still only embed images by giving a URL to somewhere else where the image is hosted; or they can upload their images together with the blog post. I think the app would only shine if we implement the latter as well.

Such things are a bit beyond the scope of this experiment, which is supposed to be about UI.

keianrao / KayaBlog1

[Discussion] Blog post contents' format #4