sfbrigade / support-sfusd

Support SF Schools web site (dev deployment at: https://support-sfusd.vercel.app)
https://supportsfschools.org
5 stars 2 forks source link

[Spike] for addressing HTML in DB #236

Open nickvisut opened 4 days ago

nickvisut commented 4 days ago

Recent change to seed file includes values that have HTML incl class names and plain text. If we store data like this, especially if it becomes editable (eg via CMS) down the road, this could result in increasing our attack surface.

Need to look into 1) best practice and 2) sanitizing or storing in a diff way.

See issue #222 for referenced code.


Original comment below:

          @BeeSeeWhy @mattgianni @thomhickey might make sense to get this merged in despite my question above. Any recos on how to tackle HTML in our data, though? Is this fine?

Originally posted by @nickvisut in https://github.com/sfbrigade/support-sfusd/issues/222#issuecomment-2359857314

mattgianni commented 3 days ago

I did a little digging last night on this topic, but couldn't find anything that seemed like an authority on it.

The general consensus seems to be that storing HTML in a database isn't really a serious security issue in itself -- the trouble comes when you render HTML content. It doesn't matter whether it is stored in a DB, a filesystem or memory ... if it comes from an untrusted source, it's dangerous. There are issues that are specific to DBs (like SQL-injection), but these issues are independent of HTML/JS.

Generally speaking, validating/sanitizing HTML from untrusted sources doesn't seem practical. Web browser are just too powerful and ever-changing. I've read some ppl are using Markdown instead to reduce the risk ... but that seems like a big mistake to me. (It might be even easier to take advantage of bugs in open-source markdown libraries ...).

If we are going to render HTML or JS on the site, whether we store it in GitHub or in Postgres at Vercel, it seems like we need to trust the authors.

nickvisut commented 3 days ago

Good stuff, thanks for looking into it! How about forcing a subset of HTML (eg via a DSL like Markdown)?

On Thu, Sep 26, 2024 at 12:08 PM Matt Gianni @.***> wrote:

I did a little digging last night on this topic, but couldn't find anything that seemed like an authority on it.

The general consensus seems to be that storing HTML in a database isn't really a serious security issue in itself -- the trouble comes when you render HTML content. It doesn't matter whether it is stored in a DB, a filesystem or memory ... if it comes from an untrusted source, it's dangerous. There are issues that are specific to DBs (like SQL-injection), but these issues are independent of HTML/JS.

Generally speaking, validating/sanitizing HTML from untrusted sources doesn't seem practical. Web browser are just too powerful and ever-changing. I've read some ppl are using Markdown instead to reduce the risk ... but that seems like a big mistake to me. (It might be even easier to take advantage of bugs in open-source markdown libraries ...).

If we are going to render HTML or JS on the site, whether we store it in GitHub or in Postgres at Vercel, it seems like we need to trust the authors.

— Reply to this email directly, view it on GitHub https://github.com/sfbrigade/support-sfusd/issues/236#issuecomment-2377727348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVRHSCRSGZQPRV3JAQ4WFTZYRLSXAVCNFSM6AAAAABO35DZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXG4ZDOMZUHA . You are receiving this because you authored the thread.Message ID: @.***>

nickvisut commented 3 days ago

Ah parsed your message a bit too quickly. What are your thoughts on having some protection vs none, however? I would think that, yes, it's an arms race, but that covering the more obvious scenarios (like don't output JS if it can be helped) would be feasible. As a rough and hyperbolic counterpoint, an analogous position would be that it's impossible to fully secure an OS b/c of 0 days, so effort in that direction could be futile.

On Thu, Sep 26, 2024 at 2:09 PM Nick Visutsithiwong @.***> wrote:

Good stuff, thanks for looking into it! How about forcing a subset of HTML (eg via a DSL like Markdown)?

On Thu, Sep 26, 2024 at 12:08 PM Matt Gianni @.***> wrote:

I did a little digging last night on this topic, but couldn't find anything that seemed like an authority on it.

The general consensus seems to be that storing HTML in a database isn't really a serious security issue in itself -- the trouble comes when you render HTML content. It doesn't matter whether it is stored in a DB, a filesystem or memory ... if it comes from an untrusted source, it's dangerous. There are issues that are specific to DBs (like SQL-injection), but these issues are independent of HTML/JS.

Generally speaking, validating/sanitizing HTML from untrusted sources doesn't seem practical. Web browser are just too powerful and ever-changing. I've read some ppl are using Markdown instead to reduce the risk ... but that seems like a big mistake to me. (It might be even easier to take advantage of bugs in open-source markdown libraries ...).

If we are going to render HTML or JS on the site, whether we store it in GitHub or in Postgres at Vercel, it seems like we need to trust the authors.

— Reply to this email directly, view it on GitHub https://github.com/sfbrigade/support-sfusd/issues/236#issuecomment-2377727348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVRHSCRSGZQPRV3JAQ4WFTZYRLSXAVCNFSM6AAAAABO35DZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXG4ZDOMZUHA . You are receiving this because you authored the thread.Message ID: @.***>

nickvisut commented 3 days ago

(wrt to eg SQL injection, we could break that out into a sibling ticket or just rename this one to be more expansive)

On Thu, Sep 26, 2024 at 2:17 PM Nick Visutsithiwong @.***> wrote:

Ah parsed your message a bit too quickly. What are your thoughts on having some protection vs none, however? I would think that, yes, it's an arms race, but that covering the more obvious scenarios (like don't output JS if it can be helped) would be feasible. As a rough and hyperbolic counterpoint, an analogous position would be that it's impossible to fully secure an OS b/c of 0 days, so effort in that direction could be futile.

On Thu, Sep 26, 2024 at 2:09 PM Nick Visutsithiwong @.***> wrote:

Good stuff, thanks for looking into it! How about forcing a subset of HTML (eg via a DSL like Markdown)?

On Thu, Sep 26, 2024 at 12:08 PM Matt Gianni @.***> wrote:

I did a little digging last night on this topic, but couldn't find anything that seemed like an authority on it.

The general consensus seems to be that storing HTML in a database isn't really a serious security issue in itself -- the trouble comes when you render HTML content. It doesn't matter whether it is stored in a DB, a filesystem or memory ... if it comes from an untrusted source, it's dangerous. There are issues that are specific to DBs (like SQL-injection), but these issues are independent of HTML/JS.

Generally speaking, validating/sanitizing HTML from untrusted sources doesn't seem practical. Web browser are just too powerful and ever-changing. I've read some ppl are using Markdown instead to reduce the risk ... but that seems like a big mistake to me. (It might be even easier to take advantage of bugs in open-source markdown libraries ...).

If we are going to render HTML or JS on the site, whether we store it in GitHub or in Postgres at Vercel, it seems like we need to trust the authors.

— Reply to this email directly, view it on GitHub https://github.com/sfbrigade/support-sfusd/issues/236#issuecomment-2377727348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVRHSCRSGZQPRV3JAQ4WFTZYRLSXAVCNFSM6AAAAABO35DZSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXG4ZDOMZUHA . You are receiving this because you authored the thread.Message ID: @.***>

mattgianni commented 1 day ago

I think it comes down to the use case. If the HTML/JS is coming from our team, I wouldn't be worried about it. Storing the HTML in a DB vs FS seems pretty similar.

If down the road we allow anonymous website users to post comments, etc., that use case would make me MUCH more nervous about user-submitted HTML of course.

(One crazy thought occurred to me though, and I'm not seriously suggesting it -- it seems like it would be possible to get one of these LLMs to review user submitted HTML/JS for potential security problems during validation - I wonder how reliable something like that could be).