jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.88k stars 2.17k forks source link

EscapeMode.none #411

Closed carlsonsantana closed 8 years ago

carlsonsantana commented 10 years ago

There is necessity to create a EscapeMode none, because are situations that force escape is not good, how the problem that I have, I need converter without espace the javascript token "&&".

xeno6696 commented 9 years ago

Fully agree. I'm in an instance where we're using jsoup for an XSS firewall implementaion on unit tests, and come to discover that Jsoup's been nullifying our attack strings so that the validation framework can push out false positives. The ability to deliberately choose to spit out raw data is extremely important.

jhy commented 9 years ago

I don't really understand the use case here. If there's no HTML escaping, the output generated will not be valid HTML. And that's the essential promise of jsoup. The contents of script tags can be escaped, and will be correctly un-escaped by any browser.

Can you clarify for me?

xeno6696 commented 9 years ago

Sure. In our XSS framework, we use AntiSamy as a whitelisting tool for various tags/attributes, etc. On the front end we used a rich text editor like TinyMCE. In our validation logic, to handle customized markdown processing we use an HTML parser, in this case Jsoup.

The problem we were running into with Jsoup is that on the initial call to Jsoup.parse(String) was taking samples of deliberately invalid HTML with custom XSS attacks, and nullifying the XSS attacks before they were processed by Antisamy. Basically, it was creating false negatives. Here's a concrete example:

<IMG SRC="<script>alert(1);</script>"> On the call to Jsoup.parse(String) it would nullify the attack attempt like this:

<IMG SRC="&lt;script&gt;alert(1);&lt;/script&gt;">

We wanted the original to simulate a MiTM attack. This version is harmless.

In other cases, say when we want to validate unbalanced tags, Jsoup puts them back. Ultimately, in the context of security scanning, you want to make sure you're not transforming the output before performing validation.

I would expect to be able to programmatically shut off the good-faith attempt to "fix" invalid HTML through some kind of config.

In the end we realized that at present the design philosophy of Jsoup was, as you said, to fix broken HTML so we had to go find another parser. But giving the programmer the ability to control when and where Jsoup fixes things would be a very democratic thing to do!

jhy commented 8 years ago

Given the number of people who have asked for this when they really just wanted to use ascii mode instead, and the number of blown off feet that would have lead to -- I'm never going to implement this. There's democratic and there's dangerous, and this is the latter.

xeno6696 commented 8 years ago

@jhy, with all due respect I don't think you understand the use case.

Some people use Jsoup in markdown processing. Because JSOUP attempts to "fix" invalid HTML, it leads to unintended consequences when actively looking for XSS attacks designed to bypass filters. It transforms invalid input into valid input.

When using JSOUP in an HTML sanitizer this is inherently dangerous. If the library receives for example, alert(1);</script> JSOUP transforms this into <script>alert (1);</script>

So, something that should have been recognized as the filter evasion it was, was transformed into an into an input that was no longer a filter-evasion. This blocks me from fine-grained analyses into whether or not I raise an Intrusion flag.

This functionality is input-destructive, and ultimately lead to false negatives in security-critical software. Inputs that were clearly attacks were transformed into "safe" inputs by virtue of this transformation.

JSOUP should allow a mode to shut off transformations like that. In any case, I have migrated all of my clients away from JSOUP.

Since this issue is closed, future developers can get this functionality with the Jericho library. Invalid HTML is far easier to identify with it.

jhy commented 8 years ago

Thanks for the elaboration. I do understand the use case. I don't agree with making the library dangerous for users.

On Sat, May 7, 2016, 8:47 PM Matt Seil notifications@github.com wrote:

@jhy https://github.com/jhy, with all due respect I don't think you understand the use case.

Some people use Jsoup in markdown processing. Because JSOUP attempts to "fix" invalid HTML, it leads to unintended consequences when actively looking for XSS attacks designed to bypass filters. It transforms invalid input into valid input.

When using JSOUP in an HTML sanitizer this is inherently dangerous. If the library receives for example, alert(1); JSOUP transforms this into

So, something that should have been recognized as the filter evasion it was, was transformed into an into an input that was no longer a filter-evasion. This blocks me from fine-grained analyses into whether or not I raise an Intrusion flag.

This functionality is input-destructive, and ultimately lead to false negatives in security-critical software. Inputs that were clearly attacks were transformed into "safe" inputs by virtue of this transformation.

JSOUP should allow a mode to shut off transformations like that. In any case, I have migrated all of my clients away from JSOUP.

Since this issue is closed, future developers can get this functionality with the Jericho library. Invalid HTML is far easier to identify with it.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/jhy/jsoup/issues/411#issuecomment-217689677

xeno6696 commented 8 years ago

Explain to me what could be more dangerous than creating false negatives in security critical software?

jhy commented 8 years ago

Users disabling escapes and letting everything through -- I have seen many cases of requests for this with people misunderstanding the way escapes work. If we supported this we would be enabling real security issues, let alone simply incorrect HTML generation.

The example you're coming from can be supported with other tools, as you've mentioned.

Having jsoup deliberately produce incorrect HTML is not something I'm ever going to support.

xeno6696 commented 8 years ago

Users disabling escapes and letting everything through -- I have seen many cases of requests for this with people misunderstanding the way escapes work.

Then you make sure you've done your job to document that danger. That's what we do with ESAPI. As an API writer, it's not your job to prevent Joe Schmoe from cutting their own hands off. I can configure iptables to accept all incoming connections, but it isn't the responsibility of iptables developers to disallow that configuration--just as it isn't your responsibility to prevent users from configuring an EscapeMode.NONE. Use Javadocs to warn. Is one of JSOUP's features to protect idiots? I don't see that anywhere at JSOUP.org.

OWASP ESAPI lets you turn off nearly any feature you want. Because in a production environment, that's actually a requirement. What if an application using JSOUP sits behind a WAF designed to block exactly the kinds of attacks that you think JSOUP is preventing? Then the application is using cycles to duplicate work that doesn't need to be done. That's a different kind of use case: What if the application needs an HTML parser but doesn't need XSS protection?

If I seem passionate about this, its because EscapeMode.NONE would have saved a project I work on 600+ man hours used in converting to Jericho and the pentesting that went along with it.