nahsra / antisamy

a library for performing fast, configurable cleansing of HTML coming from untrusted sources
BSD 3-Clause "New" or "Revised" License
190 stars 92 forks source link

AntiSamy converting single quotes to double quotes for font-family which is causing issue while rendering #200

Closed Gouravmoy closed 2 years ago

Gouravmoy commented 2 years ago

Hi,

I am using AntiSamy in my project. I am having an issue where AntiSamy is automatically converting single quotes to double quotes. This creates issues in case of rendering font family. Below is an example:

String dirtyContent = "<p>Test <span style=\"font-family: 'comic sans ms', sans-serif; color: #ba372a;\"><sup>~Nalla</sup></span></p>";
CleanResults cr = as.scan(dirtyContent, policy);
System.out.println("Dirty : " + dirtyContent);
System.out.println("Clean : " + cr.getCleanHTML());

Below is the response:

Dirty : <p>Test <span style="font-family: 'comic sans ms', sans-serif; color: #ba372a;"><sup>~Nalla</sup></span></p>
Clean : <p>Test <span style="font-family: &quot;comic sans ms&quot; , sans-serif;color: rgb(186,55,42);"><sup>~Nalla</sup></span></p>

The problem happens because of when I unescape this and send it to the UI, it becomes something like this <p>Test <span style="font-family: "comic sans ms" , sans-serif;color: rgb(186,55,42);"><sup>~Nalla</sup></span></p>

Now the UI is not able to read the font family correctly because of the double quotes. It terminates after <span style="font-family: ". This is causing issue.

I had asked the same question in SO some time ago - Link

The solution seems to be to either keep the font family in single quotes or keep the font family in double and whole style in single quotes. Reference

Please let me know if you need any more information from my side. Also thanks in advance for looking into the issue.

davewichers commented 2 years ago

@spassarop - Any ideas on whether we can fix this?

spassarop commented 2 years ago

Hi everyone!

I debugged to detect where those quotes are added. When transforming CSS "lexical units" to CSS properties in the final string buffer that is the actual output, this particular case of a name between double quotes is taken as a SAC_IDENT or string identifier. This automatically adds " to the identifier when it has spaces (like comic sans ms):

case LexicalUnit.SAC_IDENT:
    // just a string/identifier
    String stringValue = lu.getStringValue();
    if(stringValue.indexOf(" ") != -1)
        stringValue = "\""+stringValue+"\"";
    return stringValue;

A possible fix is to switch " to '. I ran all tests and only one fails because it expects " but would work anyway. I also tried changing to this: <span style='font-family: "comic sans ms", sans-serif; color: #ba372a;'> inverting the quotes, and the output is forced to have " on the outside and ' on the inside. So it works anyway.

Thinking about differences between being an inline style (as a HTML attribute) or being inside a whole style tag, I only could thought about having additional quotes surrounding the styles.

I could just adapt the only failing test case and apply the proposed fix. If being more complex, I could internally pass the info that the tag is inline or not. Then, if it is inline, surround with '. If it is not inline, surround with ". I don't think it's really necessary to do all that but if anyone can come up with a scenario where it is relevant, let me know here.

spassarop commented 2 years ago

@nahsra do you foresee any kind of conflict about this change?

dplovelybuddy commented 2 years ago

@spassarop any idea when this change is going to be published?

davewichers commented 2 years ago

@Gouravmoy @dplovelybuddy - This fix was just included in the v1.7.1 release that just went out.

dplovelybuddy commented 2 years ago

@davewichers Thanks for letting us know :)