Investigate using validator.nu SAX parser

betehess commented 11 years ago

validator.nu offers a SAX parser for html documents. It follows the specification very closely and has good performance. It would be nice to use it instead of the current one.

sideshowbarker commented 11 years ago

I guess you already can imagine I'd think this would be a really good idea. But for the record, note that the SAX parser that validator.nu uses is no strictly part of the validator.nu code and not closely bound to any other part of the validator.nu code. It's separate, discrete code (physically separate in a completely different version-control repo).

JosephJShort commented 11 years ago

I've been experimenting with the validator.nu SAX parser [0]. However I can't get my head around the SAX API.

Should I be making an implementation of org.xml.sax.ContentHandler, and then using that to iterate through each element whilst extracting the elements I need? This is a very different method from how I've used the last two HTML parsers-- I think it would require a re-design of the i18n-checker's parsing competent.

I don't have a lot of experience with these things, so perhaps I'm missing something.

[0] Hopefully we're both referring to nu.validator.htmlparser.sax.HtmlParser, which is mirrored here: http://github.com/validator/htmlparser/blob/master/src/nu/validator/htmlparser/sax/HtmlParser.java

sideshowbarker commented 11 years ago

@JosephJShort Yes, I think you'd want to be making an implementation of org.xml.sax.ContentHandler and using that to consume the elements. As I guess you can tell, it's a streaming event-based API that doesn't do anything more than just expose the parse events to you, and leave it up to you what to do those in your application.

As far as the existing parsing component in the i18n-checker, I've never looked at it before, but taking a quick glance at the code in your link, I see net.htmlparser.jericho, which I've never heard of before. And looking through the net.htmlparser.jericho.StartTag stuff in the code, it looks really.. odd. It doesn't seem to be based on any kind of standard or widely-used API -- the W3C DOM or whatever -- that I'm aware of.

I really don't know what advantages that net.htmlparser.jericho API is meant to have, but I believe that replacing it with the nu.validator.htmlparser.sax.HtmlParser would absolutely be an improvement regardless.

BTW, the canonical home for the nu.validator.htmlparser parser is at http://hg.mozilla.org/projects/htmlparser

JosephJShort commented 11 years ago

@sideshowbarker

you'd want to be making an implementation of org.xml.sax.ContentHandler it's a streaming event-based API that doesn't do anything more than just expose the parse events to you

Thanks very much for the pointers. I'll think about the new design.

I really don't know what advantages that net.htmlparser.jericho API is meant to have

When I was programming the checker, Jericho was the only HTML parser I could find that returns verbatim contexts (rather than 'cleaned up' versions of tags). It also appears to do its job correctly. But I won't defend it beyond that. I'm new to HTML parsing and I don't claim to know what's best.

but I believe that replacing it with the nu.validator.htmlparser.sax.HtmlParser would absolutely be an improvement regardless.

Indeed!

JosephJShort commented 11 years ago

FYI I'm buried in university work right now. I'm sad to say that this issue may have to wait for a couple of weeks before receiving my attention. I hope this isn't a bar to anybody's progress.

Joe

sideshowbarker commented 11 years ago

@JosephJShort there's no urgency on my part at least. Whenever you do have time to get back to it, I'm happy to help with answering questions when you need it.

JosephJShort commented 10 years ago

[Project was discontinued.]

w3c / i18n-checker-java

Investigate using validator.nu SAX parser #3