Closed betehess closed 10 years ago
I guess you already can imagine I'd think this would be a really good idea. But for the record, note that the SAX parser that validator.nu uses is no strictly part of the validator.nu code and not closely bound to any other part of the validator.nu code. It's separate, discrete code (physically separate in a completely different version-control repo).
I've been experimenting with the validator.nu SAX parser [0]. However I can't get my head around the SAX API.
Should I be making an implementation of org.xml.sax.ContentHandler
, and then using that to iterate through each element whilst extracting the elements I need? This is a very different method from how I've used the last two HTML parsers-- I think it would require a re-design of the i18n-checker's parsing competent.
I don't have a lot of experience with these things, so perhaps I'm missing something.
[0] Hopefully we're both referring to nu.validator.htmlparser.sax.HtmlParser
, which is mirrored here: http://github.com/validator/htmlparser/blob/master/src/nu/validator/htmlparser/sax/HtmlParser.java
@JosephJShort Yes, I think you'd want to be making an implementation of org.xml.sax.ContentHandler and using that to consume the elements. As I guess you can tell, it's a streaming event-based API that doesn't do anything more than just expose the parse events to you, and leave it up to you what to do those in your application.
As far as the existing parsing component in the i18n-checker, I've never looked at it before, but taking a quick glance at the code in your link, I see net.htmlparser.jericho, which I've never heard of before. And looking through the net.htmlparser.jericho.StartTag stuff in the code, it looks really.. odd. It doesn't seem to be based on any kind of standard or widely-used API -- the W3C DOM or whatever -- that I'm aware of.
I really don't know what advantages that net.htmlparser.jericho API is meant to have, but I believe that replacing it with the nu.validator.htmlparser.sax.HtmlParser would absolutely be an improvement regardless.
BTW, the canonical home for the nu.validator.htmlparser parser is at http://hg.mozilla.org/projects/htmlparser
@sideshowbarker
you'd want to be making an implementation of org.xml.sax.ContentHandler it's a streaming event-based API that doesn't do anything more than just expose the parse events to you
Thanks very much for the pointers. I'll think about the new design.
I really don't know what advantages that net.htmlparser.jericho API is meant to have
When I was programming the checker, Jericho was the only HTML parser I could find that returns verbatim contexts (rather than 'cleaned up' versions of tags). It also appears to do its job correctly. But I won't defend it beyond that. I'm new to HTML parsing and I don't claim to know what's best.
but I believe that replacing it with the nu.validator.htmlparser.sax.HtmlParser would absolutely be an improvement regardless.
Indeed!
FYI I'm buried in university work right now. I'm sad to say that this issue may have to wait for a couple of weeks before receiving my attention. I hope this isn't a bar to anybody's progress.
Joe
@JosephJShort there's no urgency on my part at least. Whenever you do have time to get back to it, I'm happy to help with answering questions when you need it.
[Project was discontinued.]
validator.nu offers a SAX parser for html documents. It follows the specification very closely and has good performance. It would be nice to use it instead of the current one.