googleinterns / step1-2020

Apache License 2.0
2 stars 4 forks source link

Sanitize and escape html #74

Closed AaronLopes closed 4 years ago

AaronLopes commented 4 years ago

Fixes #44 (review test files in #78)

anbenson commented 4 years ago

1) My understanding is that in general we will jsoup.clean (sanitize) but not escape. If you escape, you're going to see <code><p>some text</p><code> everywhere. 2) Why not the description field? 3) I notice the StackOverflow client is excluded here. Are there separate changes for those in another PR, or what's the status of that?

AaronLopes commented 4 years ago

3. I notice the StackOverflow client is excluded here. Are there separate changes for those in another PR, or what's the status of that?

  1. I don't remember there ever being nested tags in either W3/GeeksForGeeks in a non-html context (on queries tested so far/sifting through the site).
  2. Same here regarding tags in the descriptions.
  3. It was my understanding that escaping/stripping regarding SO was being done in #63, as per the latest comments.
anbenson commented 4 years ago

It's safe to clean/escape things that aren't HTML too. Whether you do so or not should depend on your intent - is the string you're considering untrusted input? In this case, all fields of the Card come from the wild (scraped websites), so they're untrusted, so you'll need to clean or escape everything, regardless of whether you expect there to be HTML or not. If you don't do so, then it's an avenue for a XSS attack. (Maybe you've never seen there being tags in the W3Schools description field, but if that causes you not to sanitize it, then a malicious actor with control of the W3Schools website could carry out an XSS attack on you.)

For SO - noted.

Still confused on why escaping is being done, though - I thought we would only do that if we know the field contains HTML code (as a hack to allow users to search for HTML code).

AaronLopes commented 4 years ago

It's safe to clean/escape things that aren't HTML too. Whether you do so or not should depend on your intent - is the string you're considering untrusted input? In this case, all fields of the Card come from the wild (scraped websites), so they're untrusted, so you'll need to clean or escape everything, regardless of whether you expect there to be HTML or not. If you don't do so, then it's an avenue for a XSS attack. (Maybe you've never seen there being tags in the W3Schools description field, but if that causes you not to sanitize it, then a malicious actor with control of the W3Schools website could carry out an XSS attack on you.)

For SO - noted.

Still confused on why escaping is being done, though - I thought we would only do that if we know the field contains HTML code (as a hack to allow users to search for HTML code).

Gotcha.

Regarding escaping - After our meeting, I was under the impression that detecting html before escaping was irrelevant and implies the use of regex, which we wanted to rule out (wasn't aware of that hack though, which is a good idea).

anbenson commented 4 years ago

I hope I wasn't misleading. Using regex to parse HTML is indeed imperfect, and I recommended not using it to determine whether to sanitize because 1) it's imperfect and 2) the regex engine is less efficient than just sanitizing it anyway, which is a simple (ish) O(n) sweep.

Yes, the hack I think we're doing by escaping HTML when we detect there may be HTML code (the simplest way to do this is probably just checking if the query contains "html"?) does open us up to XSS attacks and is Not Great. But the hope is to protect ourselves in most cases (e.g. when the query does not contain "html") and make sure the app is still usable for HTML queries.

kaylaleung commented 4 years ago

@ iris and I when ready for review!

anbenson commented 4 years ago

This PR contains the GeeksForGeeks test which doesn't seem intentional.

My understanding was that you would only escape the code if the query contained 'html' or whatever heuristic is being used, but I see that most fields are unconditionally both sanitized and escaped.

kaylaleung commented 4 years ago

Are you planning to add the other tests in a different PR?

AaronLopes commented 4 years ago

Are you planning to add the other tests in a different PR?

The GeeksForGeeks tests? They're in #78

kaylaleung commented 4 years ago

Are you planning to add the other tests in a different PR?

The GeeksForGeeks tests? They're in #78

I meant the escaping tests for W3Schools

AaronLopes commented 4 years ago

Are you planning to add the other tests in a different PR?

The GeeksForGeeks tests? They're in #78

I meant the escaping tests for W3Schools

Yes, these should be in a separate PR.

joanchenx commented 4 years ago

Are you planning to add the other tests in a different PR?

The GeeksForGeeks tests? They're in #78

I meant the escaping tests for W3Schools

Yes, these should be in a separate PR.

Can you create an issue for this?