soulcutter / saxerator

A SAX-based XML parser for parsing large files into manageable chunks
MIT License
128 stars 19 forks source link

Space in node parsed as empty hash? Can this be changed to string, or nil? #61

Open twigbranch opened 7 years ago

twigbranch commented 7 years ago

When parsing nodes like

<testnode> </testnode>

Saxerator (using ox) returns an empty hash. Is there a way to return a string, or nil? (instead of manually converting empty hashes to a string or nil? Other parsers I've used don't convert to an empty hash.

fanantoxa commented 7 years ago

@twigbranch Cloud you give me an example please? Because actually you should get Saxerator::Builder::StringElement instance, so if you need a String type you should call .to_s. (check last QA in FAQ) But if it's not, please, give me example and I'll check what is wrong here.

twigbranch commented 7 years ago

file at filepath:

<parent>
   <child1>child1 text</child1>
   <child2> </child2>
</parent>
parser = Saxerator.parser(File.new(filepath)) do |config|
      config.adapter = :ox
end

p=parser.for_tag('parent')

p.first
=> {"child1"=>"child1 text", "child2"=>{}}

p.first['child1'].class
=> Saxerator::Builder::StringElement

p.first['child2'].class
=> Saxerator::Builder::EmptyElement
fanantoxa commented 7 years ago

@twigbranch You probably using a v0.9.8 or older. There was a changes related to that: Here So Saxerator::Builder::EmptyElement doesn't exist in master. Could you try to use master branch? (we going to release version 1.0.0 soon, but 2 pull requests required)

fanantoxa commented 7 years ago

@twigbranch Did you have a time to check?

twigbranch commented 7 years ago

I installed via:

gem 'saxerator', :git => 'https://github.com/soulcutter/saxerator.git'
gem 'ox'

bundle install

Here's what I'm reportedly using now: Using saxerator 0.9.8 from https://github.com/soulcutter/saxerator.git (at master)

I get:

p.first
=> {"child1"=>"child1 text", "child2"=>{}}

p.first['child1'].class
=> Saxerator::Builder::StringElement

p.first['child2'].class
=> Saxerator::Builder::HashElement

So child2 is now ::HashElement (as opposed to EmptyElement), but why isn't it ::StringElement? Isn't it standard for parsers to return a string there?

twigbranch commented 7 years ago

Here's the nokogiri output - both are text nodes:

doc = File.open(filepath) { |f| Nokogiri::XML(f) }
=> #<Nokogiri::XML::Document:0x3fd290d339a0 name="document" children=[#<Nokogiri::XML::Element:0x3fd290d3366c name="parent" children=[#<Nokogiri::XML::Text:0x3fd290d3348c "\n  ">, #<Nokogiri::XML::Element:0x3fd290d333d8 name="child1" children=[#<Nokogiri::XML::Text:0x3fd290d331f8 "child1 text">]>, #<Nokogiri::XML::Text:0x3fd290d33054 "\n  ">, #<Nokogiri::XML::Element:0x3fd290d32fa0 name="child2" children=[#<Nokogiri::XML::Text:0x3fd290d32dc0 " ">]>, #<Nokogiri::XML::Text:0x3fd290d32c1c "\n">]>]>

Here's ox output:

xml = File.open(filepath).read
=> "<parent>\n  <child1>child1 text</child1>\n  <child2> </child2>\n</parent>\n"

doc = Ox.parse(xml)
=> #<Ox::Element:0x007fa52a201b70 @value="parent", @nodes=[#<Ox::Element:0x007fa52a201af8 @value="child1", @nodes=["child1 text"]>, #<Ox::Element:0x007fa52a201a58 @value="child2", @nodes=[" "]>]>
twigbranch commented 7 years ago

Wouldn't it make more sense to return <element />,<element></element>, <element> </element> as text, and <element><child /></element> and <element><child></child></element> as hash?

twigbranch commented 7 years ago

This issue is similar: https://github.com/soulcutter/saxerator/issues/12

twigbranch commented 7 years ago

Here's how Ruby on Rails ActiveSupport handles it:

Hash.from_xml(File.read(filepath))
=> {"parent"=>{"child1"=>"child1 text", "child2"=>" "}}

and for completely empty elements:

Hash.from_xml("<parent>\n  <child1>child1 text</child1>\n  <child2></child2>\n</parent>\n")
=> {"parent"=>{"child1"=>"child1 text", "child2"=>nil}}

and
Hash.from_xml("<parent>\n  <child1>child1 text</child1>\n  <child2 />\n</parent>\n")
=> {"parent"=>{"child1"=>"child1 text", "child2"=>nil}}

So no empty hash returned for empty, blank, or white space nodes - only string or nil.

fanantoxa commented 7 years ago

@twigbranch I think Saxerator::Builder::EmptyElement have to become s StringElement But something went wrong.

@soulcutter I saw your PR where you're removing EmptyElement. I think

def add_node(node)
        @text = true if node.is_a? String
        @children << node
      end

Doesn't get String type when the node contain a white space. What would you say about it? From my perspective I think it should handle it as white space.

twigbranch commented 7 years ago

Also, a completely empty element without even white space should be nil like ActiveSupport from_xml, no? (or at least "" text, but not a {})

soulcutter commented 7 years ago

I will take a look at this over the weekend. This sounds odd - I'm also curious if this is an ox-only thing (doesn't sound like it..).

This does illustrate that I need to release a new version - will probably wait to track this down to bump versions. If I understand correctly, I do believe returning a StringElement would be the proper behavior.

twigbranch commented 7 years ago

I tried it with the default rexml and got the same result: (::HashElement, not ::StringElement).

fanantoxa commented 7 years ago

@soulcutter ^ Would be nice if you find the issue and also finish one PR about return enumeration.

I'll also try to finish my changes about relapsing lambdas with abstractions. So we'll be release even first stable version of gem)

fanantoxa commented 6 years ago

@soulcutter Did you have a chanse to check what is wrong? I have some free time now, so if y'd explain me I can try to make a PR.