nahsra / antisamy

a library for performing fast, configurable cleansing of HTML coming from untrusted sources
BSD 3-Clause "New" or "Revised" License
191 stars 92 forks source link

<a/href=javascript:[1].find(alert)>CLICKHERE</a> does not return error #92

Closed R06ky closed 2 years ago

R06ky commented 3 years ago

The policy of tag a is below. The clean HTML removed the /href attribute without any error, could you help to have a look at it? How to return an error message for this case?

rev="1.6.3"

Thanks in advance!

         <tag name="a" action="validate">

            <!--  onInvalid="filterTag" has been removed as per suggestion at OWASP SJ 2007 - just "name" is valid -->
            <attribute name="href"/>
            <attribute name="onFocus"/>
            <attribute name="onBlur"/>
            <attribute name="nohref">
                <regexp-list>
                    <regexp name="anything"/>
                </regexp-list>
            </attribute>
            <attribute name="rel">
                <literal-list>
                    <literal value="nofollow"/>
                </literal-list>
            </attribute>
            <attribute name="name"/>

            <attribute name="target">
                <regexp-list>
                    <regexp value="[a-zA-Z0-9\-_\$]+"/>
                </regexp-list>
            </attribute>

        </tag>
davewichers commented 3 years ago

@spassarop - We've had similar issues like this in the past where things are cleaned with no warnings or errors. That's just how it works in many cases.

@R06ky - Are you saying this is cleaning the HTML properly? But you want AntiSamy to report that it did remove something? Or that you think AntiSamy shouldn't be stripping out /href?

R06ky commented 3 years ago

Thanks, @davewichers Yes, the clean HTML return <a>CLICKHERE</a>, it is expected. But I want to get an error message for this scan result. Do you know how to tweak the policy to get errorMessages for this case?

davewichers commented 3 years ago

See the last two comments on issue #47 and what we did about this same problem before. We didn't change the code. We updated the documentation to be more clear that error messages aren't always generated. So they only true way to know if something was 'bad' is to compare the original to the clean HTML.

@spassarop - Do you have any idea if AntiSamy might reformat something that is good? I.e., not strip anything, but still reformat the data, even slightly, so the 'clean' output is different from the original? If that is the case, then any difference between original and clean might not indicate 'badness', just that they are different.

R06ky commented 3 years ago

@davewichers

String html = "<a/href=javascript:[1].find(alert)>CLICKHERE</a>";
System.out.println(AntiSamyUtility.scanSB(html).getCleanHTML()); // output: <a>CLICKHERE</a>
System.out.println(AntiSamyUtility.scanSB(html).getErrorMessage("")); // output:

html = "<a href=javascript:[1].find(alert)>CLICKHERE</a>";
System.out.println(AntiSamyUtility.scanSB(html).getCleanHTML()); //output: <a>CLICKHERE</a>
System.out.println(AntiSamyUtility.scanSB(html).getErrorMessage("")); // output: The href attribute had a value of "javascript&#58;&#91;1&#93;&#46;find&#40;alert&#41;". This value could not be accepted for security reasons. We have chosen to remove this attribute from the tag and leave everything else in place so that we could process the input.

The two html should return the same error message, or the first one should return an error message as The /href attribute has been filtered out....

R06ky commented 3 years ago
html = "<a/id='@#'></a>";
Assert.assertEquals("<a></a>", AntiSamyUtility.scan(html).getCleanHTML()); // pass
Assert.assertEquals(1,AntiSamyUtility.scan(html).getNumberOfErrors() ); // fail
java.lang.AssertionError: 
Expected :1
Actual   :0
R06ky commented 3 years ago

It is my dependency.

<!--============ AntiSamy Dependencies ============== -->
        <!-- https://mvnrepository.com/artifact/org.owasp.antisamy/antisamy -->
        <dependency org="org.owasp.antisamy" name="antisamy" rev="1.6.3"/>
        <!-- https://mvnrepository.com/artifact/net.sourceforge.nekohtml/nekohtml -->
        <dependency org="net.sourceforge.nekohtml" name="nekohtml" rev="1.9.22"/>
        <!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-css -->
        <dependency org="org.apache.xmlgraphics" name="batik-css" rev="1.14"/>
        <!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-i18n -->
        <dependency org="org.apache.xmlgraphics" name="batik-i18n" rev="1.9"/>
        <!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-util -->
        <dependency org="org.apache.xmlgraphics" name="batik-util" rev="1.14"/>
        <!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-shared-resources -->
        <dependency org="org.apache.xmlgraphics" name="batik-shared-resources" rev="1.14"/>
        <!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-constants -->
        <dependency org="org.apache.xmlgraphics" name="batik-constants" rev="1.14"/>
        <!-- https://mvnrepository.com/artifact/xerces/xercesImpl -->
        <dependency org="xerces" name="xercesImpl" rev="2.8.1"/>
        <!-- https://mvnrepository.com/artifact/xml-apis/xml-apis -->
        <dependency org="xml-apis" name="xml-apis" rev="1.3.03"/>
        <!-- https://mvnrepository.com/artifact/xml-apis/xml-apis-ext -->
        <dependency org="xml-apis" name="xml-apis-ext" rev="1.3.04"/>
davewichers commented 3 years ago

@spassarop - given the above, any way we can improve AntiSamy error handling so the example that starts with: <a/href will also return an error?

R06ky commented 3 years ago

Hi @davewichers , comparing the input HTML and the clean HTML is not a solution. For example,

html = "<a id='1'></a>&nbsp";
Assert.assertEquals(html, AntiSamyUtility.scan(html).getCleanHTML());
org.junit.ComparisonFailure: 
Expected :<a id='1'></a>&nbsp
Actual   :<a id="1"></a> 
R06ky commented 3 years ago

Investigating it, it is an issue from org.cyberneko.html.HTMLScanner.ContentScanner#scan

public boolean scan(boolean complete) throws IOException {
...
    while(scan) {
            if (c == '/') {
            scanEndElement();
        }
    }
...    
}

'/' end the scan, so before the HTML is sending to AntiSamy, the HTML is parsed from<a/id='@#'></a> to <a></a>, the AntiSamy policy cannot filter the /+any attribute. out absolutely, since the /+any attribute is not existed anymore.

davewichers commented 3 years ago

Given the problem is inside cyberneko, I believe we have encountered issues like this before. As such, we may not be able to do anything to fix this without throwing cyberneko out and using a different parser or writing our own. And that's a big ask that we probably won't do. I'll leave it to @spassarop for his thoughts. He's far more expert on the internals of AntiSamy than me.

spassarop commented 3 years ago

I was waiting to be on my computer to verify that last observation by debugging. But that’s it, AntiSamy does not have the exact same HTML parsed by CyberNeko and something similar happens with the serializer output depending on factors like if HTML or XHTML is used.

From our side is very difficult to “fix” or accommodate to those input/output differences without messing up with the incoming/resulting strings in all cases. That’s the risk about depending on other libraries and sometimes we just have to adapt to their results, on every software component we don’t have absolute control :/

On Fri, 9 Jul 2021 at 13:21 R06ky @.***> wrote:

Investigating it, it is an issue from org.cyberneko.html.HTMLScanner.ContentScanner#scan

public boolean scan(boolean complete) throws IOException {... while(scan) { if (c == '/') { scanEndElement(); } }... }

'/' end the scan, so before the HTML is sending to AntiSamy, the HTML is parsed froma/id='@#' to , the AntiSamy policy cannot filter the /+any attribute. out absolutely, since the /+any attribute is not existed anymore.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/92#issuecomment-877303273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMILCH3PRV3MSUI2NZTTW4OZFANCNFSM5AC34L5Q .

spassarop commented 3 years ago

About the specific parser… changing CyberNeko would be an extreme change because AntiSamy relies on that library’s behavior and XML structure output. A different library will have its own different bugs or unresolved issues/features although it may solve some of the issues present in CyberNeko, it’s always a trade off.

Of course, if AntiSamy becomes a problem there are alternatives. This project is one of them https://owasp.org/www-project-java-html-sanitizer/. The main conceptual difference that may difficult migration for testing is that the policy is built by code and not by configuration files (as far as I remember).

On Fri, 9 Jul 2021 at 13:33 Dave Wichers @.***> wrote:

Given the problem is inside cyberneko, I believe we have encountered issues like this before. As such, we may not be able to do anything to fix this without throwing cyberneko out and using a different parser or writing our own. And that's a big ask that we probably won't do. I'll leave it to @spassarop https://github.com/spassarop for his thoughts. He's far more expert on the internals of AntiSamy than me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/92#issuecomment-877310211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMOTYVRHZS42J32HJPLTW4QEZANCNFSM5AC34L5Q .

davewichers commented 3 years ago

@R06ky - Given this is an issue we've run into before, and aren't prepared to address, I'm going to close this issue as "won't fix" unless you have any brilliant ideas, like getting CyberNeko to implement new features we can leverage.

R06ky commented 3 years ago

I have an idea. Before the input HTML(maybe need decode or something) is parsed by CyberNeko, we validate it by a regex(which is defined by the allowed tags) to find this case firstly and return a error. Since it seems / it is the only case which is skipped by policy so far.

        final String regex = "(?=/).\\w++(?<!input|a)";// input|a are the allowed tags in policy.
        final String string = "<input/> <input></input> <a/href=javascr&#x69;pt:[1].find(alert) /href=javascr&#x69;pt:[1].find(alert) >CLICKHERE</a>\n";
        final Pattern pattern = Pattern.compile(regex);
        final Matcher matcher = pattern.matcher(string);
        while (matcher.find()) {
            String matcherString = matcher.group();
            System.out.println(matcherString); // give an error message.
        }
davewichers commented 2 years ago

@spassarop - Can you review the above idea provided by @R06ky? Does it help and is feasible?

spassarop commented 2 years ago

I don't see this feasible because this solution may adjust to @R06ky's problem but not for the general case. AntiSamy cannot rely on regexps to give an error beforehand and completely apart from the actual analysis made after parsing the HTML into objects. Also that approach must be done on the whole HTML string trying to evade comments and any other thing that may result in false positives or omissions, again, doesn't apply for the general case.

I'm glad the solution works for that specific policy and use case but in the whole library use, it just doesn't cover much. I would leave it as wontfix as it depends entirely on CyberNeko.

davewichers commented 2 years ago

OK. Closing this as a won't fix per Sebastian's explanation.