soulcutter / saxerator

A SAX-based XML parser for parsing large files into manageable chunks
MIT License
128 stars 19 forks source link

Ox SAX Parser not Raising Errors #55

Closed mgeyer closed 7 years ago

mgeyer commented 7 years ago

Hi There,

I was debugging some fault XML (invalid byte sequence for UTF-8) and noticed that Saxerator just ignored the error and stopped processing. It would process all nodes successfully up to that point, and then stop finding more elements.

At first I thought that was Saxerator having an issue, but upon closer inspection it turns out that is how the Ox SAX parser is configured by default. In short, unless an error method is defined for the Ox SAX parser it will silently close the remaining tags. You can check out (https://github.com/ohler55/ox/issues/166) to get more details.

How do people feel about adding the error method and raising an exception when an error is found by the Ox parser?

fanantoxa commented 7 years ago

@mgeyer I'll research about that for OX, but also take a look how Rexml and Nokogiri will work on this kind of situation. Response with proposals after. Shouldn't take much time.

@soulcutter What do you think about it?

mgeyer commented 7 years ago

@fanantoxa Thanks for taking a look :) I would hope that both Rexml and Nokogiri at least have support for throwing errors on parsing errors. If they do then we should make sure that either all adapters throw errors or none do.

soulcutter commented 7 years ago

I think it makes sense for Saxerator to implement the #error method and raise an error.

fanantoxa commented 7 years ago

@soulcutter, @mgeyer Nokogiri and Ox have a error calbek. I have to test manualy with REXML, since error calback not documenter. Same for Added recently oga parser

soulcutter commented 7 years ago

I'm not -too- concerned that all parsers raise errors (or none of them) - I think the expectation is that different parsers may have slightly different behavior based on what they provide. If we can raise an error, I would like to, though, since I think it's a better practice than being silent.

mgeyer commented 7 years ago

Agreed, or at least a way to configure the parsers to raise errors.

fanantoxa commented 7 years ago

@mgeyer , @soulcutter Yep. Same for me. As I understand Oga raise and exception on error. Didn't check REXML yet, but it might be the same. I think the best idea here would be to catch this exceptions (or add error callback for first 2 parser) and raise our exception like Saxerator::DataCorruptedException. In this case any parser on error would behave the same. How you look on that?

fanantoxa commented 7 years ago

@mgeyer, @soulcutter Finished in master

mgeyer commented 7 years ago

Thanks for the hard work @fanantoxa :)