radkovo / jStyleParser

jStyleParser is a CSS parser written in Java. It has its own application interface that is designed to allow an efficient CSS processing in Java and mapping the values to the Java data types. It parses CSS 2.1 style sheets into structures that can be efficiently assigned to DOM elements. It is intended be the primary CSS parser for the CSSBox library. While handling errors, it is user agent conforming according to the CSS specification.
http://cssbox.sourceforge.net/jstyleparser/
GNU Lesser General Public License v3.0
92 stars 49 forks source link

StackOverflowError using 3.0 #74

Open 942u3895hjf opened 7 years ago

942u3895hjf commented 7 years ago

We're integrating jStyleParser and when running hundreds of HTML parse tests, i get a StackOverflowError exception when analyzing a specific stylesheet. I have the HTML but i cannot attach it to this issue so i can send it upon request.

Stack trace: java.lang.StackOverflowError at __randomizedtesting.SeedInfo.seed([CDDE6B651FEC8B45:8F931EECCE17E8E5]:0) at java.lang.String.indexOf(String.java:1503) at java.lang.String.split(String.java:2338) at java.lang.String.split(String.java:2422) at cz.vutbr.web.csskit.ElementMatcherSafeCS.elementClasses(ElementMatcherSafeCS.java:46) at cz.vutbr.web.domassign.Analyzer.assignDeclarationsToElement(Analyzer.java:256) at cz.vutbr.web.domassign.Analyzer$2.processNode(Analyzer.java:214) at cz.vutbr.web.domassign.Analyzer$2.processNode(Analyzer.java:211) at cz.vutbr.web.domassign.Traversal.levelTraversal(Traversal.java:53) at cz.vutbr.web.domassign.Traversal.levelTraversal(Traversal.java:58) at cz.vutbr.web.domassign.Traversal.levelTraversal(Traversal.java:58) .....

I narrowed it down to two stylesheets on the specific page, but when loading those stylesheets in other HTML pages, the error didn't show up.

radkovo commented 6 years ago

I may take a look. Would you send the relevant HTML to radkovo@users.sf.net ? Thanks!

942u3895hjf commented 6 years ago

Thanks, i have sent the file.

radkovo commented 6 years ago

From your e-mail:

I have reduced the reproducing code to just CSSFactory:

    try {
      StyleSheet stylesheet = CSSFactory.parse("/path/to/file.html", "UTF-8");
      Analyzer stylesheetAnalyzer = new Analyzer(stylesheet);
      return stylesheetAnalyzer.evaluateDOM(super.document, CSSFactory.getAutoImportMedia(), true);
    } catch (Exception e) {}

This seems to me like you are trying to parse a HTML document with a CSS parser. You need to parse the HTML file to a DOM first and then to call evaluateDOM() on this DOM (not sure what super.document contains in your example). See the relevant tests for an example.

942u3895hjf commented 6 years ago

I am sorry, that is the wrong code indeed.

try {
  return CSSFactory.assignDOM(super.document, "UTF-8", new URL("https://example.org/"), CSSFactory.getAutoImportMedia(), true);
} catch (Exception e) {
  LOG.error("Could not evaluate DOM because: " + e.getMessage(), e);
}

super.document is the org.w3c.dom.Document representation of the file i sent you. All other 800+ tests run fine though.

radkovo commented 6 years ago

I used the DOMSource for obtaining the DOM but I am not able to reproduce the error. I obtain the style map correctly. How do you obtain the DOM? Are you able to do some more debugging? E.g. obtaining the argument values of the split() call?

942u3895hjf commented 6 years ago

Hello,

I use a very simple SAX ContentHandler to build a DOM structure but i could try a SAX parser to do it for me if it isn't too much of a performance drag.

Of what split() call are you referring to? I don't explicitly execute a split() call myself.

Thanks!

radkovo commented 6 years ago

I mean the fourth line in the stack trace. It seems that the stack overflow occurs during the String split() operation. So it would be interesting to know what string we are actually trying to split. Moreover, I have no idea what __randomizedtesting is in the stack trace. Does it correspond to anything in your project?

942u3895hjf commented 6 years ago

The __randomizedtesting just wraps around our tests and randomizes some settings, it does not affect this test. Early next week i'll compile the sources and report back what is passed to the split() command.

Thanks! Have a nice weekend!

942u3895hjf commented 6 years ago

Hello, i added some logging just before the split, just System.out.println(classNames); It turns out it tries to split() over 3390 times! I think the parsers ends up in an endless recursion. The classNames below are a set that is repeated 200 times before the Exception occurs.

right reviews_image rev_name review_content reviews_profile rev_client rev_profile toplink toplinktext clearer rev_shadow review_right rev_shadow_right clearer reviews_item review_left reviews_box