soulcutter / saxerator

A SAX-based XML parser for parsing large files into manageable chunks
MIT License
128 stars 19 forks source link

Implement Ox support #2

Closed soulcutter closed 8 years ago

soulcutter commented 11 years ago

It would be nice to support parsers other than Nokogiri. Ox in particular is supposed to have great performance, and so would be a good first candidate.

jalberto commented 10 years ago

This looks like a very interesting feature, After some test OX it's way faster than Nokogiri, but less feature complete, in special ralated to Xpath, but the SAX feature looks quite similar to Nokogiri.

Is there somebody working on this already?

soulcutter commented 10 years ago

No, this has been on the backburner for a while - I just haven't had a reason to revisit it, though it's pretty much what's holding up a 1.0 release

jalberto commented 10 years ago

ox-mapper could be a good start:

https://github.com/take-five/ox-mapper

doomspork commented 9 years ago

I'm currently replacing sax-machine with saxerator in a project of mine and if all goes well I'd love to try my hand at this ox support.

soulcutter commented 9 years ago

:+1:

fanantoxa commented 8 years ago

@soulcutter, @jalberto Hi guys, I've started working on implementing Ox parser for saxerator. Here you can find commits: https://github.com/fanantoxa/saxerator/commits/implementing-ox-parser This is the first scratch and have to be refactored after, but at first, I want to make code works. Could you help me with it, now parsing strings works good, but have some problems with file parsing (don't want to parse nested items). I'll be very happy if you'll help me with implementing or just with some suggestions.

soulcutter commented 8 years ago

33 is closer to what I had in mind (although totally broken in its current state)

Rather than duplicating everything to get ox in there, I took the approach of trying to extract the most basic interface that would work https://github.com/soulcutter/saxerator/blob/extract-adapter/lib/saxerator/sax_handler.rb and move all references to nokogiri into https://github.com/soulcutter/saxerator/blob/extract-adapter/lib/saxerator/adapters/nokogiri.rb

In its current state I have not yet even begun the ox handler, but I'm hoping it would be fairly simple (although I may have to tweak the SaxHandler interface depending on how it reads attributes, but one thing at a time)

soulcutter commented 8 years ago

Hey, I got the adapter working for nokogiri! Do you think that's a solid-enough basis for adding ox support?

fanantoxa commented 8 years ago

@soulcutter Looks good. Bu it might be no enough. Different parsers have different capabilities. Actually Ox lighter than nokogiry. I'll take a look at code tomorrow.

soulcutter commented 8 years ago

IINM The tricky thing with ox will be how it parses attributes, but I think there should be a way to collect those before triggering a start_element(name, attrs)

fanantoxa commented 8 years ago

@soulcutter Hi. Sorry for delay, I've been a bit overloaded on new job

I've taken a look on your changes and looks cool. But As you mention we have problems with Latches.

Ox have different callbacks with different count of params:

def instruct(target); end
def end_instruct(target); end
def attr(name, str); end
def attr_value(name, value); end
def attrs_done(); end
def doctype(str); end
def comment(str); end
def cdata(str); end
def text(str); end
def value(value); end
def start_element(name); end
def end_element(name); end

Instead of nokogori:

So I thnik we have to add here new abtraction for Latches too.

soulcutter commented 8 years ago

I have a pretty good idea of how I can match ox callbacks to the SaxHandler api. I'll have something to show within a couple days.

fanantoxa commented 8 years ago

@soulcutter Cool)) If you have no time you can tell me what you want to change and I'll try implement it)

fanantoxa commented 8 years ago

@soulcutter Also I've researched a bit on few other parsers that you wanted to implement too. And I think we have to think a bit more about abstraction because they also have different callbacks:

Oga

on_document
on_doctype
on_cdata
on_comment
on_proc_ins
on_xml_decl
on_text
on_element
on_element_children
on_attribute
on_attributes
after_element

LibXML

on_cdata_block(cdata)
on_characters(chars)
on_comment(msg)
on_end_document()
on_end_element_ns(name, prefix, uri)
on_error(msg)
on_external_subset(name, external_id, system_id)
on_has_external_subset()
on_has_internal_subset()
on_internal_subset(name, external_id, system_id)
on_is_standalone()
on_processing_instruction(target, data)
on_reference(name)
on_start_document()
on_start_element_ns(name, attributes, prefix, uri, namespaces)

So will be very well if we'll have opportunity to changes callbacks names

soulcutter commented 8 years ago

The different APIs are the reason behind the adapter extraction. Each adapter will take the messages sent by the parser and translate them to the constrained API which we define (which happens to be implemented through delegation because it's the first implementation, and because it seemed like a sensible enough interface).

For example, oga will not send a start_element until it gets a message other than attribute-related ones. Then it will have stored the element name and all its attributes, so it's a complete payload for our interface.

Bradley Schaefer

On Jul 15, 2016, at 6:21 PM, fanantoxa notifications@github.com wrote:

@soulcutter Also I've researched a bit on few other parsers that you wanted to implement too. And I think we have to think a bit more about abstraction because they also have different callbacks:

Oga

on_document on_doctype on_cdata on_comment on_proc_ins on_xml_decl on_text on_element on_element_children on_attribute on_attributes after_element LibXML

on_cdata_block(cdata) on_characters(chars) on_comment(msg) on_end_document() on_end_element_ns(name, prefix, uri) on_error(msg) on_external_subset(name, external_id, system_id) on_has_external_subset() on_has_internal_subset() on_internal_subset(name, external_id, system_id) on_is_standalone() on_processing_instruction(target, data) on_reference(name) on_start_document() on_start_element_ns(name, attributes, prefix, uri, namespaces) So will be very well if we'll have opportunity to changes callbacks names

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

soulcutter commented 8 years ago

34 outlines sorta what I had in mind

fanantoxa commented 8 years ago

@soulcutter I've created pull request to ox-adapter branch #35

soulcutter commented 8 years ago

This has been resolved in #38 and #39