soulcutter / saxerator

A SAX-based XML parser for parsing large files into manageable chunks
MIT License
128 stars 19 forks source link

Attributes are lost on single child #17

Closed mmonsta closed 10 years ago

mmonsta commented 10 years ago

It might be me doing something wrong, but when a node can have zero or more children and there is only one child, the parsed object becomes an array without its attributes being present.

For example, in case of the following xml the children of the first node are returned separately as objects inspected as {:N=>[{}, {}, {}]}, with .attributes['F'] returning 189 and 190 respectively. In the second case an array is returned like [:N, [{}, {}, {}]] and the F attribute is no longer accessible (undefined method 'attributes' for Array).

<Root>
    <N F="1" T="t">
        <N K="p" T="n" V="4"/>
        <N K="t" T="t">
            <N F="189" T="t">
                <N K="i" T="s" V="data1"/>
                <N K="t" T="s" V="data2"/>
                <N K="n" T="s" V="data3"/>
            </N>
            <N F="190" T="t">
                <N K="i" T="s" V="data1"/>
                <N K="t" T="s" V="data2"/>
                <N K="n" T="s" V="data3"/>
            </N>
        </N>
    </N>
    <N F="2" T="t">
        <N K="p" T="n" V="8"/>
        <N K="t" T="t">
            <N F="195" T="t">
                <N K="i" T="s" V="data1"/>
                <N K="t" T="s" V="data2"/>
                <N K="n" T="s" V="data3"/>
            </N>
        </N>
    </N>
</Root>
soulcutter commented 10 years ago

What is the parser you're using on this document? Your explanation is a bit hard for me to follow, it would help to be able to run the code to see for myself what's going on.

mmonsta commented 10 years ago

Probably Nokogiri, I am just instantiating it with Saxerator.parser

Here is a sample code to reproduce the issue, it supposed to dump a tree from the xml above but fails at node <N F="195" T="t">:

require 'saxerator'

def dump_nodes(children,level)
  children.each do |data|
    puts "##{data.inspect}"
    if data.attributes['T']=='t'
      puts (' '*level) + (data.attributes['K']||data.attributes['F'])
      dump_nodes(data[:N],level+1) if data[:N]
    else
      puts (' '*level) + (data.attributes['K']||data.attributes['F']) + ' = ' + data.attributes['V']
    end
  end
end

xml = Saxerator.parser(File.new('test.xml')) do |config|
  config.symbolize_keys!
end

dump_nodes(xml.at_depth(1),0)
soulcutter commented 10 years ago

Thanks, I'll take a look at this tomorrow

leifg commented 10 years ago

The problem persists though:

Saxerator.parser('<xml><tag attribute="content"/></xml>'){|c| c.put_attributes_in_hash!}.all["tag"]["attribute"]
=> nil

So it seems if a tag has no children, the attributes will be swallowed. Accessing them with attributes is only a workaround.

soulcutter commented 10 years ago

Wow I can't believe it's been 20 days. I will get back around for this, thanks for the comment reminder. PRs are welcome too if you have the time to take a look :)

leifg commented 10 years ago

Created a pull request

soulcutter commented 10 years ago

I just pushed version 0.9.4 with this fix. Thanks for your patience and help.