nahsra / antisamy

a library for performing fast, configurable cleansing of HTML coming from untrusted sources
BSD 3-Clause "New" or "Revised" License
186 stars 89 forks source link

Selfclosing Break Line Tag <br /> tag in html content is converted into <br> open tag. #484

Open BloodDrag0n opened 1 month ago

BloodDrag0n commented 1 month ago

I am using Antisamy to sanitize HTML contents. During the parsing of the HTML Data, the
(self-closing) tag is converted into
(open tag). Is there any specific reason behind this behavior change? Is there any way to retain the
tags?

Eg Input Html Data: <p>this is para data</p><br/><p>this is para data 2</p>

Eg Output Html Data: <p>this is para data</p><br><p>this is para data 2</p>

HtmlUnit-Neko version using - 3.11.2 Antisamy version using - 1.7.5

davewichers commented 1 month ago

@spassarop @rbri - Can you look into this?

rbri commented 1 month ago
  @Test
  public void issue484() throws ScanException, PolicyException {
    String s = "<p>this is para data</p><br/><p>this is para data 2</p>";
    CleanResults crDom = as.scan(s, policy, AntiSamy.DOM);
    CleanResults crSax = as.scan(s, policy, AntiSamy.SAX);
    String domValue = crDom.getCleanHTML();
    String saxValue = crSax.getCleanHTML();
    assertEquals("<p>this is para data</p>\n"
            + "<br>\n"
            + "<p>this is para data 2</p>", domValue);
    assertEquals("<p>this is para data</p>\n"
            + "<br>\n"
            + "<p>this is para data 2</p>", saxValue);
  }

Wrote a short test (also one for neko (https://github.com/HtmlUnit/htmlunit-neko/commit/1d19ec2b97f1279a709459b84592fced2fda4c62)) but can't see why anything should have changed.

rbri commented 1 month ago

@BloodDrag0n do you think the output was different with other versions?

BloodDrag0n commented 1 month ago

I have been using Antisamy-1.6.4 with nekohtml-1.9.22, it does not had any change like this in
tag. Recently we have switched to Antisamy-1.7.5 with HtmlUnit-Neko-3.11.2

Where this behavior was found. This behavior is also seen for <img> and <a> tags.

In 1.6.4 - self closing tags like is converted into seperare open and close tags but in 1.7.5 - is converted into .

This doesn't caused any parsing issue for us, but some of our integration products demands either selfclosing (
) or open-close (
) tags.

rbri commented 1 month ago

did a bunch of tests with old versions - looks like merging these two commits into the 1.7.0 version are the reason for that behavior change.

image

BloodDrag0n commented 1 month ago

So the removal of XHTML usage has caused this right, is there a way to achieve this in the latest version?

BloodDrag0n commented 1 month ago

@rbri , This behavior is also seen for <img> and <a> tags.

tw-mcummings commented 1 month ago

We are running into this issue too. We do use XHTML and when we updated antisamy to fix a cve, we ran into this. Any chance of bring back XHTML support? Our current workaround is to create a custom antisamy jar with the code for XHTML added back in.

davewichers commented 1 month ago

@spassarop - Can we fix this without adding XHTML support back? That's probably an @rbri question. And even if we can, what would you think about adding XHTML support back, but have it OFF by default, so groups like @tw-mcummings could enable it if they need it? Does that make any sense to you?

spassarop commented 1 month ago

I checked the behavior from XHTML back in 2022. It was deprecated as a project decision because at the time it was related to a vulnerability and also it did not make much sense to keep supporting that as it is not HTML by the spec, it enforces XML structure.

Checking old emails I found this comments:

Changes XHTML removal has two side effects:

  • A minimal one that makes tags that were like "<parm x=1 />" before, now are output like "<parm x=1>".
  • Many HTML tags that are not mandatory to close, are not. The output serializer does not close tags like LI, UL, OPTION, etc. So outputs like "<ul><li>a</li></ul>" are now "<ul><li>a</ul>".

That was inside the custom HTML serializer, the only one left as the XHTML one was in process of being deleted. What I added to the code was HTMLdtd.isOptionalClosing(rawName) in the last condition of the if-block below:

if (rawName == null || !HTMLdtd.isOnlyOpening(rawName) || HTMLdtd.isOptionalClosing(rawName)) {
   if (_indenting && !state.preserveSpace && state.afterElement)
      _printer.breakLine();
   // Must leave CData section first (Illegal in HTML, but still)
   if (state.inCData)
      _printer.printText("]]>");
      _printer.printText("</");
      _printer.printText(state.rawName);
      _printer.printText('>');
}

To make it a bit more compatible and adapted the tests later. Those are @rbri marked commits on the picture. However, checking the XHTML serializer again, it shared a line with _printer.printText(" />"); right before the code block that decides the closing logic after writing the opening tag, but on XHTML is like that and in HTML is _printer.printText(">"). Also, in the HTML serializer code there is no condition to execute that line that considers if the tag requires closing or it is allowed to be empty. These are the code fragments on each serializer to print that closing characters:

// HTML serializer
if (state.empty) _printer.printText('>');
// instant closing tag logic assuming no content inside the tags
// ...
// XHTML serializer
if (state.empty && isAllowedEmptyTag(rawName) && !requiresClosingTag(rawName)) { 
  _printer.printText(" />");
} else {
  if (state.empty) _printer.printText('>');
  // instant closing tag logic assuming no content inside the tags
  //...

That is a big difference. What I did then, was copy the code and behavior to use the XHTML serializer fragment on the HTML serializer code. Then ran the tests to see if the ones associated with the vulnerabilities failed. They did not fail, which is good. The ones that failed did it due to the /> difference which I inserted on the change. Fixing them means reverting the tests to their 2022-state.

However, there are some considerations: 1- I don't see the difference of having the space in " />" and not in '>'. It seems to me the space can both be or not be there and it would work anyway. 2- I have no idea if this change has greater impact. I only ran against the whole test base to verify what I know. Also many tests check the absence or presence of strings rather than checking a strict HTML, format may vary and end up in invalid HTML. 3- One of the tests (testGithubIssue453) is significantly different as it expects to contain "<body> <table> <select name=\"Lang\"> <option value=\"da\">Dansk</option> " but it has <body xmlns="http://www.w3.org/1999/xhtml"> <table> <select name="Lang"> <option value="da">Dansk</option>. Maybe neko is adding that namespace? @rbri is there anything we can do about that?

Any comments? I mean, is looks like a good workaround that solves the initial issue, but with those small details.

rbri commented 1 month ago

@spassarop can you please make a PR out of your changes - that will it make easier to follow and to test

spassarop commented 1 month ago

Done. I was about to push adapted tests but I found out DOM parser is ignoring my change, it only affects SAX. I've been debugging a while and could not find out a way to make it work when invoking the serializer on AntiSamyDOMScanner.

So I only added a modified version of the test @rbri wrote to reflect that. At the moment, my change gives a partial solution and people would need to use SAX if they want the closing tag behavior as it is requested. If we leave it that way, some tests would need to be adapted to have the SAX and DOM version closing the tags differently. Even though, I don't understand why the DOM parser behaves like that if it did not before with the custom logic I added... maybe a serializer dependency changed on the last two years?

FYI: Deprecating DOM parser was on the table for a while. This might be a new reason to do so.

davewichers commented 1 month ago

@spassarop - I'll let you and @rbri figure out the best way to handle all this as you two are the experts on this stuff.

rbri commented 4 weeks ago

i think we have to rename 'issue484()' to 'testGithubIssue484()'

and i'm still a bit confused - issue484() passes for sax and dom - @spassarop i can't see the problem with the dom handling

spassarop commented 4 weeks ago

It’s not that the test fails because I just copied it. The thing is, for DOM it does not auto close single tags like
, SAX does with what I added. But when they used XHTML in the past they both behaved the same, adding /> and not just >. The discrepancy is what I see as a problem.

On Thu, 15 Aug 2024 at 13:09 RBRi @.***> wrote:

i think we have to rename 'issue484()' to 'testGithubIssue484()'

and i'm still a bit confused - issue484() passes for sax and dom - @spassarop https://github.com/spassarop i can't see the problem with the dom handling

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/484#issuecomment-2291616112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMIRDLIJQZCQBZ6S6TLZRTHDXAVCNFSM6AAAAABK6LKR32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGYYTMMJRGI . You are receiving this because you were mentioned.Message ID: @.***>

rbri commented 4 weeks ago

@spassarop oh i was blind... sorry for the confusion

rbri commented 4 weeks ago

@spassarop

The trick is in class HTMLSerializer - they use an optimization when serializing dom nodes.

/**
 * Called to serialize a DOM element. Equivalent to calling {@link
 * #startElement}, {@link #endElement} and serializing everything
 * inbetween, but better optimized.
 */
protected void serializeElement( Element elem )
    throws IOException
{
.....

Instead of calling startElement/endElement they call this serializeElement method; therefor your patch for the endElement does not work (for the dom serialization).

You can copy the code of the method to our ASHTMLSerializer and massage the code a bit

  /**
   * Called to serialize a DOM element. Equivalent to calling {@link
   * #startElement}, {@link #endElement} and serializing everything
   * inbetween, but better optimized.
   */
  @Override
  protected void serializeElement( Element elem )
      throws IOException
  {
      Attr         attr;
      NamedNodeMap attrMap;
      int          i;
      Node         child;
      ElementState state;
      boolean      preserveSpace;
      String       name;
      String       value;
      String       tagName;

      tagName = elem.getTagName();
      state = getElementState();
      if ( isDocumentState() ) {
          // If this is the root element handle it differently.
          // If the first root element in the document, serialize
          // the document's DOCTYPE. Space preserving defaults
          // to that of the output format.
          if ( ! _started )
              startDocument( tagName );
      } else {
          // For any other element, if first in parent, then
          // close parent's opening tag and use the parnet's
          // space preserving.
          if ( state.empty )
              _printer.printText( '>' );
          // Indent this element on a new line if the first
          // content of the parent element or immediately
          // following an element.
          if ( _indenting && ! state.preserveSpace &&
               ( state.empty || state.afterElement ) )
              _printer.breakLine();
      }
      preserveSpace = state.preserveSpace;

      // Do not change the current element state yet.
      // This only happens in endElement().

      // XHTML: element names are lower case, DOM will be different
      _printer.printText( '<' );
      _printer.printText( tagName );
      _printer.indent();

      // Lookup the element's attribute, but only print specified
      // attributes. (Unspecified attributes are derived from the DTD.
      // For each attribute print it's name and value as one part,
      // separated with a space so the element can be broken on
      // multiple lines.
      attrMap = elem.getAttributes();
      if ( attrMap != null ) {
          for ( i = 0 ; i < attrMap.getLength() ; ++i ) {
              attr = (Attr) attrMap.item( i );
              name = attr.getName().toLowerCase(Locale.ENGLISH);
              value = attr.getValue();
              if ( attr.getSpecified() ) {
                  _printer.printSpace();
                  // HTML: Empty values print as attribute name, no value.
                  // HTML: URI attributes will print unescaped
                  if ( value == null ) {
                      value = "";
                  }
                  if ( !_format.getPreserveEmptyAttributes() && value.length() == 0 )
                      _printer.printText( name );
                  else if ( HTMLdtd.isURI( tagName, name ) ) {
                      _printer.printText( name );
                      _printer.printText( "=\"" );
                      _printer.printText( escapeURI( value ) );
                      _printer.printText( '"' );
                  } else if ( HTMLdtd.isBoolean( tagName, name ) )
                      _printer.printText( name );
                  else {
                      _printer.printText( name );
                      _printer.printText( "=\"" );
                      printEscaped( value );
                      _printer.printText( '"' );
                  }
              }
          }
      }
      if ( HTMLdtd.isPreserveSpace( tagName ) )
          preserveSpace = true;

      // If element has children, or if element is not an empty tag,
      // serialize an opening tag.
      if ( elem.hasChildNodes() || ! HTMLdtd.isEmptyTag( tagName ) ) {
          // Enter an element state, and serialize the children
          // one by one. Finally, end the element.
          state = enterElementState( null, null, tagName, preserveSpace );

          // Prevents line breaks inside A/TD
          if ( tagName.equalsIgnoreCase( "A" ) || tagName.equalsIgnoreCase( "TD" ) ) {
              state.empty = false;
              _printer.printText( '>' );
          }

          // Handle SCRIPT and STYLE specifically by changing the
          // state of the current element to CDATA (XHTML) or
          // unescaped (HTML).
          if ( tagName.equalsIgnoreCase( "SCRIPT" ) ||
               tagName.equalsIgnoreCase( "STYLE" ) ) {
              // HTML: Print contents unescaped
              state.unescaped = true;
          }
          child = elem.getFirstChild();
          while ( child != null ) {
              serializeNode( child );
              child = child.getNextSibling();
          }
          endElementIO( null, null, tagName );
      } else {
          _printer.unindent();
          // XHTML: Close empty tag with ' />' so it's XML and HTML compatible.
          // HTML: Empty tags are defined as such in DTD no in document.
          if (!elem.hasChildNodes() && isAllowedEmptyTag(tagName) && !requiresClosingTag(tagName))
              _printer.printText( "/>" );
          else
              _printer.printText( '>' );
          // After element but parent element is no longer empty.
          state.afterElement = true;
          state.empty = false;
          if ( isDocumentState() )
              _printer.flush();
      }
  }

This fixes at least the current problem.

rbri commented 4 weeks ago

We really have to think about using our own independent serializer - i guess it might be possible to carve out the serializing code from xerces using some hours.

But i have no idea what does that mean regarding copyright etc.

spassarop commented 3 weeks ago

I applied your suggestions to the current PR, everything seems consistent and working now.

Regarding having an AntiSamy serializer, of course it would be the best option in terms of controlling the output shape. The cons are the one you say about legal stuff, and that other issues may arise with time as it happens with Neko and we may not be aware. Someone should be checking periodically if the reused code is affected and make fixes fit. Trade offs I guess...