Closed R06ky closed 2 years ago
@spassarop - We've had similar issues like this in the past where things are cleaned with no warnings or errors. That's just how it works in many cases.
@R06ky - Are you saying this is cleaning the HTML properly? But you want AntiSamy to report that it did remove something? Or that you think AntiSamy shouldn't be stripping out /href?
Thanks, @davewichers
Yes, the clean HTML return <a>CLICKHERE</a>
, it is expected. But I want to get an error message for this scan result.
Do you know how to tweak the policy to get errorMessages for this case?
See the last two comments on issue #47 and what we did about this same problem before. We didn't change the code. We updated the documentation to be more clear that error messages aren't always generated. So they only true way to know if something was 'bad' is to compare the original to the clean HTML.
@spassarop - Do you have any idea if AntiSamy might reformat something that is good? I.e., not strip anything, but still reformat the data, even slightly, so the 'clean' output is different from the original? If that is the case, then any difference between original and clean might not indicate 'badness', just that they are different.
@davewichers
String html = "<a/href=javascript:[1].find(alert)>CLICKHERE</a>";
System.out.println(AntiSamyUtility.scanSB(html).getCleanHTML()); // output: <a>CLICKHERE</a>
System.out.println(AntiSamyUtility.scanSB(html).getErrorMessage("")); // output:
html = "<a href=javascript:[1].find(alert)>CLICKHERE</a>";
System.out.println(AntiSamyUtility.scanSB(html).getCleanHTML()); //output: <a>CLICKHERE</a>
System.out.println(AntiSamyUtility.scanSB(html).getErrorMessage("")); // output: The href attribute had a value of "javascript:[1].find(alert)". This value could not be accepted for security reasons. We have chosen to remove this attribute from the tag and leave everything else in place so that we could process the input.
The two html should return the same error message, or the first one should return an error message as The /href attribute has been filtered out....
html = "<a/id='@#'></a>";
Assert.assertEquals("<a></a>", AntiSamyUtility.scan(html).getCleanHTML()); // pass
Assert.assertEquals(1,AntiSamyUtility.scan(html).getNumberOfErrors() ); // fail
java.lang.AssertionError:
Expected :1
Actual :0
It is my dependency.
<!--============ AntiSamy Dependencies ============== -->
<!-- https://mvnrepository.com/artifact/org.owasp.antisamy/antisamy -->
<dependency org="org.owasp.antisamy" name="antisamy" rev="1.6.3"/>
<!-- https://mvnrepository.com/artifact/net.sourceforge.nekohtml/nekohtml -->
<dependency org="net.sourceforge.nekohtml" name="nekohtml" rev="1.9.22"/>
<!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-css -->
<dependency org="org.apache.xmlgraphics" name="batik-css" rev="1.14"/>
<!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-i18n -->
<dependency org="org.apache.xmlgraphics" name="batik-i18n" rev="1.9"/>
<!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-util -->
<dependency org="org.apache.xmlgraphics" name="batik-util" rev="1.14"/>
<!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-shared-resources -->
<dependency org="org.apache.xmlgraphics" name="batik-shared-resources" rev="1.14"/>
<!-- https://mvnrepository.com/artifact/org.apache.xmlgraphics/batik-constants -->
<dependency org="org.apache.xmlgraphics" name="batik-constants" rev="1.14"/>
<!-- https://mvnrepository.com/artifact/xerces/xercesImpl -->
<dependency org="xerces" name="xercesImpl" rev="2.8.1"/>
<!-- https://mvnrepository.com/artifact/xml-apis/xml-apis -->
<dependency org="xml-apis" name="xml-apis" rev="1.3.03"/>
<!-- https://mvnrepository.com/artifact/xml-apis/xml-apis-ext -->
<dependency org="xml-apis" name="xml-apis-ext" rev="1.3.04"/>
@spassarop - given the above, any way we can improve AntiSamy error handling so the example that starts with: <a/href will also return an error?
Hi @davewichers , comparing the input HTML and the clean HTML is not a solution. For example,
html = "<a id='1'></a> ";
Assert.assertEquals(html, AntiSamyUtility.scan(html).getCleanHTML());
org.junit.ComparisonFailure:
Expected :<a id='1'></a> 
Actual :<a id="1"></a>
Investigating it, it is an issue from org.cyberneko.html.HTMLScanner.ContentScanner#scan
public boolean scan(boolean complete) throws IOException {
...
while(scan) {
if (c == '/') {
scanEndElement();
}
}
...
}
'/' end the scan, so before the HTML is sending to AntiSamy, the HTML is parsed from<a/id='@#'></a>
to <a></a>
, the AntiSamy policy cannot filter the /+any attribute. out absolutely, since the /+any attribute is not existed anymore.
Given the problem is inside cyberneko, I believe we have encountered issues like this before. As such, we may not be able to do anything to fix this without throwing cyberneko out and using a different parser or writing our own. And that's a big ask that we probably won't do. I'll leave it to @spassarop for his thoughts. He's far more expert on the internals of AntiSamy than me.
I was waiting to be on my computer to verify that last observation by debugging. But that’s it, AntiSamy does not have the exact same HTML parsed by CyberNeko and something similar happens with the serializer output depending on factors like if HTML or XHTML is used.
From our side is very difficult to “fix” or accommodate to those input/output differences without messing up with the incoming/resulting strings in all cases. That’s the risk about depending on other libraries and sometimes we just have to adapt to their results, on every software component we don’t have absolute control :/
On Fri, 9 Jul 2021 at 13:21 R06ky @.***> wrote:
Investigating it, it is an issue from org.cyberneko.html.HTMLScanner.ContentScanner#scan
public boolean scan(boolean complete) throws IOException {... while(scan) { if (c == '/') { scanEndElement(); } }... }
'/' end the scan, so before the HTML is sending to AntiSamy, the HTML is parsed froma/id='@#' to , the AntiSamy policy cannot filter the /+any attribute. out absolutely, since the /+any attribute is not existed anymore.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/92#issuecomment-877303273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMILCH3PRV3MSUI2NZTTW4OZFANCNFSM5AC34L5Q .
About the specific parser… changing CyberNeko would be an extreme change because AntiSamy relies on that library’s behavior and XML structure output. A different library will have its own different bugs or unresolved issues/features although it may solve some of the issues present in CyberNeko, it’s always a trade off.
Of course, if AntiSamy becomes a problem there are alternatives. This project is one of them https://owasp.org/www-project-java-html-sanitizer/. The main conceptual difference that may difficult migration for testing is that the policy is built by code and not by configuration files (as far as I remember).
On Fri, 9 Jul 2021 at 13:33 Dave Wichers @.***> wrote:
Given the problem is inside cyberneko, I believe we have encountered issues like this before. As such, we may not be able to do anything to fix this without throwing cyberneko out and using a different parser or writing our own. And that's a big ask that we probably won't do. I'll leave it to @spassarop https://github.com/spassarop for his thoughts. He's far more expert on the internals of AntiSamy than me.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/92#issuecomment-877310211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMOTYVRHZS42J32HJPLTW4QEZANCNFSM5AC34L5Q .
@R06ky - Given this is an issue we've run into before, and aren't prepared to address, I'm going to close this issue as "won't fix" unless you have any brilliant ideas, like getting CyberNeko to implement new features we can leverage.
I have an idea. Before the input HTML(maybe need decode or something) is parsed by CyberNeko, we validate it by a regex(which is defined by the allowed tags) to find this case firstly and return a error. Since it seems / it is the only case which is skipped by policy so far.
final String regex = "(?=/).\\w++(?<!input|a)";// input|a are the allowed tags in policy.
final String string = "<input/> <input></input> <a/href=javascript:[1].find(alert) /href=javascript:[1].find(alert) >CLICKHERE</a>\n";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
String matcherString = matcher.group();
System.out.println(matcherString); // give an error message.
}
@spassarop - Can you review the above idea provided by @R06ky? Does it help and is feasible?
I don't see this feasible because this solution may adjust to @R06ky's problem but not for the general case. AntiSamy cannot rely on regexps to give an error beforehand and completely apart from the actual analysis made after parsing the HTML into objects. Also that approach must be done on the whole HTML string trying to evade comments and any other thing that may result in false positives or omissions, again, doesn't apply for the general case.
I'm glad the solution works for that specific policy and use case but in the whole library use, it just doesn't cover much. I would leave it as wontfix
as it depends entirely on CyberNeko.
OK. Closing this as a won't fix per Sebastian's explanation.
The policy of tag a is below. The clean HTML removed the /href attribute without any error, could you help to have a look at it? How to return an error message for this case?
rev="1.6.3"
Thanks in advance!