goodcui / owaspantisamy

Automatically exported from code.google.com/p/owaspantisamy
0 stars 0 forks source link

Appending newlines to every line in the clean HTML #107

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Input something like the following:
"<h1>Header</h1>
<p>Paragraph</p>"
2. Execute the following code:
Policy policy = Policy.getInstance(pathToPolicyFile);
AntiSamy as = new AntiSamy();
CleanResults cr = as.scan(dirtyHtml, policy);
String cleanHtml = cr.getCleanHTML();

What is the expected output? What do you see instead?
I would expect to see the output HTML to look exactly the same as the input 
HTML, but what I get instead is:
"<h1>Header</h1>

<p>Paragraph</p>

"

What version of the product are you using? On what operating system?
AntiSamy 1.4.4
Solaris 10 and Windows XP

Please provide any additional information below.
I had to download the antisamy-required-libs-1.2.zip in order to get AntiSamy 
to run, even though this file is deprecated.

Original issue reported on code.google.com by samjones...@gmail.com on 27 Apr 2011 at 5:15

GoogleCodeExporter commented 9 years ago
Can you look at the hex output of the AntiSamy getCleanHTML() call and confirm 
this? I just made a test case and it doesn't display this behavior. Here's the 
test:

sb = new StringBuilder();
String header = "<h1>Header</h1>";
String para = "<p>Paragraph</p>";
sb.append(header);
sb.append(nl);
sb.append(para);

String crDom = as.scan(sb.toString(), policy, AntiSamy.DOM).getCleanHTML();
String crSax = as.scan(sb.toString(), policy, AntiSamy.SAX).getCleanHTML();

/* Make sure only 1 newline appears */
assertTrue(crDom.lastIndexOf(nl) == crDom.indexOf(nl));
assertTrue(crSax.lastIndexOf(nl) == crSax.indexOf(nl));

int expectedLoc = header.length() + 1;
int actualLoc = crSax.indexOf(nl);
assertTrue(expectedLoc == actualLoc);

actualLoc = crDom.indexOf(nl);
// account for line separator length difference
assertTrue(expectedLoc == actualLoc || expectedLoc == actualLoc+1);

Original comment by arshan.d...@gmail.com on 7 Jun 2011 at 9:08

GoogleCodeExporter commented 9 years ago
Arshan,

I have a couple of things to mention regarding this issue.  First off, I am 
assuming you are trying with 1.4.4, and not with the current trunk, right?

Assuming we are still testing 1.4.4, I noticed a couple of things.  I think the 
people who have experienced this issue have been using the default scan method 
which uses DOM.  Running your test, I notice that the SAX implementation (at 
least for me) does not add the extra newline characters.

One slight difference I am experiencing from the issue reporter is that I only 
see the second additional newline in the case that there is text after the 
second set of html tags.  For example, "<h1>Header</h1><p>Paragraph</p>test" 
causes two newlines, without the "test" on the end, just one.  And that brings 
up another observation, I see the newlines added regardless of newlines 
existing prior to the cleaning.

One more thing to mention, your last two tests (DOM and SAX) don't really 
provide much.  Even if a second newline was being added, you would *expect* to 
see the first newline in the same place that you added it.  Unless those tests 
were created to show that the newlines weren't being removed by the cleaning.

Original comment by tad...@gmail.com on 10 Jun 2011 at 2:27

GoogleCodeExporter commented 9 years ago
I have finally gotten chance to get back to this, and I am seeing the same 
behavior as the previous commenter. I get extra newlines when I do not specify 
the implementation. When I use the SAX implementation, my problems go away.

Unlike the previous commenter, I see this behavior whether or not there is 
extra text after the HTML. Curiously, antisamy does not add an extra newline to 
the very last line. So I was incorrect in my original post. It goes from being

(I'll add '\n' to make it more clear)

"<h1>Header</h1>\n
<p>Paragraph</p>"

to

"<h1>Header</h1>\n
\n
<p>Paragraph</p>"

I also noticed that if there are two tags on one line, antisamy puts a newline 
between them. For example, if I have "<h1>Header</h1><p>Paragraph</p>" then 
antisamy gives me

"<h1>Welcome</h1>\n
<p>Paragraph</p>"

Original comment by samjones...@gmail.com on 15 Jun 2011 at 6:00

GoogleCodeExporter commented 9 years ago
An update to this issue...

It appears as though updating my policy with the line "<directive 
name="formatOutput" value ="false"/>" prevents the newlines from being added.  
By default it is turned on, but turning it off seems to do what I am wanting.

Arshan, were you using the defaults when trying to reproduce?

Sam, perhaps you can try this as well and see if it will be an acceptable 
workaround.  Let me know if you see any issues with this (I haven't tested it a 
whole lot).

-Troy

Original comment by tad...@gmail.com on 13 Jul 2011 at 9:59

GoogleCodeExporter commented 9 years ago
Indeed, selecting to have AntiSamy format your output will result in whitespace 
modification. I did notice discrepancies with newline behavior, so I have made 
some changes to HEAD.

Original comment by arshan.d...@gmail.com on 15 Sep 2011 at 8:13