Open hrj opened 9 years ago
This has blossomed into a bigger issue, and I have lots more thoughts about it.
Firstly, apart from the side-note about quirks mode in browsers, there is another consideration: In XHTML, the element tag names are case-sensitive, but not in HTML. This is causing many failures in our test-suite.
Secondly, I have a better idea than storing differently-cased copies of strings. I have tried a quick POC and it gives a huge performance boost, but it will require a big change in API.
The idea is to enrich the w3c interfaces to add more methods that are relevant to jStyleParser. Imagine an interface like this:
interface EnrichedElement extends org.w3c.dom.Element {
public boolean matchTagName(String name);
}
Throughout the jStyleParser code, the EnrichedElement
interface would be used instead of Element
, and its methods will be called to do the actual matching. This will allow the element to determine the best way to match the name, according to the doctype, quirks mode, etc.
The same concept would be applied to attributes, nodes, documents, etc.
If this is too big a change, then a workaround is to use the standard Element
everywhere, but typecast it to EnrichedElement
after an instance of
check. I already have a proof-of-concept of this approach. It gives a performance boost owing to reduced number of .toLowerCase()
calls, but is slower than optimal, because of the instance of
checks and type casts.
I have been thinking about this a bit and it seems to me that we need different matching strategies depending on the used DOM implementation and the document being processed. I would like to preserve the possibility of processing a DOM coming from any source (parser). The idea of EnrichedElement
is nice but not compatible with the DOM standard. On the other hand, different parsers allow different optimizations. E.g. the NekoHTML parser used by CSSBox is able to convert all the element and attribute names to lowercase; assuming this would simplify the matching.
Currently, most of the related code is concentrated in the static ElementUtil class. So would it make sense to make this class configurable? E.g. implement something like CSSFactory.registerElementMatcher()
similarly to other configurable implementations.
This would allow to provide an optimized implementation of the matching routines according to the used DOM source. We could provide a standard (bulletproof) DOM support (both case-sensitive and case-insensitive) as well as a simplified version assuming lowercase names and even the version with the EnrichedElement
support if some custom matching is required.
What's your opinion?
So would it make sense to make this class configurable? E.g. implement something like CSSFactory.registerElementMatcher() similarly to other configurable implementations.
Yes, this sounds like a good solution. :+1:
The drawback is that it would require typecasts in the enriched matcher's implementation, but the typecasts don't need an instanceof check and the cost for typecasts is much less than .toLowerCase()
.
Let's go for it.
I have pushed a first implementation to the matching branch. I have also implemented a few different matchers (case sensitive and insensitive). It's prepared for further performance experiments.
@radkovo Thanks; it looks awesome. I am trying to benchmark the branch right now. Some quick thoughts:
ElementMatcher.matchesName(Element, String)
the second argument seems to be nullable. This requires an additional check in the code, so it would be nice to be either specified or eliminated.One more thought:
ElementClasses.elementClasses()
could return an Iterator
instead of Collection
since it is being used only for iteration. An Iterator
can be implemented more efficiently.Can the contract for nullability be defined on the arguments? For example, in
ElementMatcher.matchesName(Element, String)
the second argument seems to be nullable. This requires an additional check in the code, so it would be nice to be either specified or eliminated.
Ok, I couldn't find any reason why null
is being checked here. I have removed the null
check.
The ElementMatcher needs to be specified globally currently. But if the browser is having different contexts each needing different matchers, it will cause problems. I don't know what would be the optimal point to specify it though; perhaps at the Analyser instance?
I agree. However, this requires to pass the matcher to the selectors when they are being matched. I have added the same mechanism that is used for MatchCondition
passing (for matching pseudo classes). There is a new Analyzer.registerElementMatcher()
method for that.
Actually the ElementMatcher
and MatchCondition
seem to be very similar mechanisms now. It makes me think whether we should join them to a signle one (e.g. to move the pseudo class matching to ElementMatcher
and remove the MatchCondition
completely). On the other hand, the pseudo class matching typically changes dynamically where the element matching remains stable for a single displayed page. So probably it makes sense to keep them separated.
ElementClasses.elementClasses()
could return anIterator
instead ofCollection
since it is being used only for iteration. AnIterator
can be implemented more efficiently.
You're right. On the other hand, since there are typically a few classes (up to 4) specified for the element, I mean the performance gain wouldn't be even measurable. And I find the Collection
interface a bit nicer (more general). However, if you think the performance gain would be significant for some reason, we can do that.
The method
String.toLowerCase
is used at multiple places in the code, and gets called a lot of times.Sometimes, these calls are avoidable. For example, when a selector stores a class name, it can convert the class name to lower-case when the string is stored, rather than when the string is being matched.
Caveats
We have to be careful in this conversion, because the compiler can't check whether we are doing the right thing. In a way, the current implementation is more robust because it always calls
.toLowerCase()
and the logic is easy to understand.One way out of this could be to follow a naming convention. Whenever a string is stored as lower-case, the field name can indicate that. For example, instead of defining a field called
className
we can define it asclassNameLC
. (LC == Lower Case)Side-note
The attributes in DOM specs are case-sensitive, while the selector names in CSS spec are insensitive in quirks mode. In fully-standard compliant mode, the selectors in CSS are supposed to be case-sensitive.
It might be possible to take a stand and make the entire library fully-standards compliant, and thus avoid the conversions to lower case entirely.
Side-note 2
There is a new selector in Selector Level 4 Spec, which explicitly defines a case-sensitive selector (even if the User agent is in quirks mode).