radkovo / jStyleParser

jStyleParser is a CSS parser written in Java. It has its own application interface that is designed to allow an efficient CSS processing in Java and mapping the values to the Java data types. It parses CSS 2.1 style sheets into structures that can be efficiently assigned to DOM elements. It is intended be the primary CSS parser for the CSSBox library. While handling errors, it is user agent conforming according to the CSS specification.
http://cssbox.sourceforge.net/jstyleparser/
GNU Lesser General Public License v3.0
92 stars 49 forks source link

Consider standardizing on pre-conversion to lower case strings where possible #29

Open hrj opened 9 years ago

hrj commented 9 years ago

The method String.toLowerCase is used at multiple places in the code, and gets called a lot of times.

Sometimes, these calls are avoidable. For example, when a selector stores a class name, it can convert the class name to lower-case when the string is stored, rather than when the string is being matched.

Caveats

We have to be careful in this conversion, because the compiler can't check whether we are doing the right thing. In a way, the current implementation is more robust because it always calls .toLowerCase() and the logic is easy to understand.

One way out of this could be to follow a naming convention. Whenever a string is stored as lower-case, the field name can indicate that. For example, instead of defining a field called className we can define it as classNameLC. (LC == Lower Case)

Side-note

The attributes in DOM specs are case-sensitive, while the selector names in CSS spec are insensitive in quirks mode. In fully-standard compliant mode, the selectors in CSS are supposed to be case-sensitive.

It might be possible to take a stand and make the entire library fully-standards compliant, and thus avoid the conversions to lower case entirely.

Side-note 2

There is a new selector in Selector Level 4 Spec, which explicitly defines a case-sensitive selector (even if the User agent is in quirks mode).

hrj commented 9 years ago

This has blossomed into a bigger issue, and I have lots more thoughts about it.

Firstly, apart from the side-note about quirks mode in browsers, there is another consideration: In XHTML, the element tag names are case-sensitive, but not in HTML. This is causing many failures in our test-suite.

Secondly, I have a better idea than storing differently-cased copies of strings. I have tried a quick POC and it gives a huge performance boost, but it will require a big change in API.

The idea is to enrich the w3c interfaces to add more methods that are relevant to jStyleParser. Imagine an interface like this:

interface EnrichedElement extends org.w3c.dom.Element {
  public boolean matchTagName(String name);
}

Throughout the jStyleParser code, the EnrichedElement interface would be used instead of Element, and its methods will be called to do the actual matching. This will allow the element to determine the best way to match the name, according to the doctype, quirks mode, etc.

The same concept would be applied to attributes, nodes, documents, etc.


If this is too big a change, then a workaround is to use the standard Element everywhere, but typecast it to EnrichedElement after an instance of check. I already have a proof-of-concept of this approach. It gives a performance boost owing to reduced number of .toLowerCase() calls, but is slower than optimal, because of the instance of checks and type casts.

radkovo commented 8 years ago

I have been thinking about this a bit and it seems to me that we need different matching strategies depending on the used DOM implementation and the document being processed. I would like to preserve the possibility of processing a DOM coming from any source (parser). The idea of EnrichedElement is nice but not compatible with the DOM standard. On the other hand, different parsers allow different optimizations. E.g. the NekoHTML parser used by CSSBox is able to convert all the element and attribute names to lowercase; assuming this would simplify the matching.

Currently, most of the related code is concentrated in the static ElementUtil class. So would it make sense to make this class configurable? E.g. implement something like CSSFactory.registerElementMatcher() similarly to other configurable implementations.

This would allow to provide an optimized implementation of the matching routines according to the used DOM source. We could provide a standard (bulletproof) DOM support (both case-sensitive and case-insensitive) as well as a simplified version assuming lowercase names and even the version with the EnrichedElement support if some custom matching is required.

What's your opinion?

hrj commented 8 years ago

So would it make sense to make this class configurable? E.g. implement something like CSSFactory.registerElementMatcher() similarly to other configurable implementations.

Yes, this sounds like a good solution. :+1:

The drawback is that it would require typecasts in the enriched matcher's implementation, but the typecasts don't need an instanceof check and the cost for typecasts is much less than .toLowerCase().

Let's go for it.

radkovo commented 8 years ago

I have pushed a first implementation to the matching branch. I have also implemented a few different matchers (case sensitive and insensitive). It's prepared for further performance experiments.

hrj commented 8 years ago

@radkovo Thanks; it looks awesome. I am trying to benchmark the branch right now. Some quick thoughts:

hrj commented 8 years ago

One more thought:

radkovo commented 8 years ago

Can the contract for nullability be defined on the arguments? For example, in ElementMatcher.matchesName(Element, String) the second argument seems to be nullable. This requires an additional check in the code, so it would be nice to be either specified or eliminated.

Ok, I couldn't find any reason why null is being checked here. I have removed the null check.

The ElementMatcher needs to be specified globally currently. But if the browser is having different contexts each needing different matchers, it will cause problems. I don't know what would be the optimal point to specify it though; perhaps at the Analyser instance?

I agree. However, this requires to pass the matcher to the selectors when they are being matched. I have added the same mechanism that is used for MatchCondition passing (for matching pseudo classes). There is a new Analyzer.registerElementMatcher() method for that.

Actually the ElementMatcher and MatchCondition seem to be very similar mechanisms now. It makes me think whether we should join them to a signle one (e.g. to move the pseudo class matching to ElementMatcher and remove the MatchCondition completely). On the other hand, the pseudo class matching typically changes dynamically where the element matching remains stable for a single displayed page. So probably it makes sense to keep them separated.

ElementClasses.elementClasses() could return an Iterator instead of Collection since it is being used only for iteration. An Iterator can be implemented more efficiently.

You're right. On the other hand, since there are typically a few classes (up to 4) specified for the element, I mean the performance gain wouldn't be even measurable. And I find the Collection interface a bit nicer (more general). However, if you think the performance gain would be significant for some reason, we can do that.