Provide option to strip attributes for empty tags

lukasbischof commented 2 years ago

Introduces a new option "strip_attributes_of_empty_tags" which can be used to ignore any attributes when an empty tag is found, in favour of consistent return types. This can be helpful if the parsed XML document mainly specifies namespaces or other attributes one might not be interested in to reduce the code needed during parsing.

My use case which motivated this feature was a bug I had when parsing a feed, which sometimes returned

<Street xmlns="...">MyStreet</Street>
<BuildingNumber xmlns="...">6</BuildingNumber>

but sometimes also

<Street xmlns="...">MyStreet 6</Street>
<BuildingNumber xmlns="..." />

The code which processed the parsed feed looked something like

Model.update(address: [parsed[:street], parsed[:building_number]].join(' '))

which (sometimes) produced addressed like MyStreet {:@xmlns=>"..."}.

Since I'm not interested in the attribute anyway, it would be easier to just strip them already while parsing. closes #97

pcai commented 2 years ago

Hi. Thanks for making this PR as we discussed. My understanding from reading #97 is you care about consistency and the problem was that the presence of an extra attribute without a value changes the shape of an empty tag (e.g., without child elements or inner text). I agree this is weird!

To expand more on the problem, it should be the case that

result = Nori.new(ignore_empty_attributes: true).parse('<foo bar />')
#=> {"foo"=>nil}

should give similar same result as

result = Nori.new.parse('<foo />')
#=> {"foo"=>nil}

because they are semantically equivalent, except that result['foo'].attributes would return {"bar" => ""} in the first case.

Is there a reason you took this current approach of stripping the attributes instead? My fear is that as-is...you are adding an additional feature that makes nori operate in a different, backwards-incompatible way, instead of "opt in to make the api more consistent"

lukasbischof commented 2 years ago

Hi, thanks for your review. I'm not sure if I understand your concern about backwards-incompatibility correctly. The stripping is deactivated by default, so omitting the option would not activate the feature thus behave as it used to. Hence, all tests still pass without me setting it explicitly to false.

The reason why I implemented stripping of all attributes of empty tags is because it solves my current issue I have in my project. There, I had parsing difficulties because the SOAP API I'm using returns data sometimes already concatenated and sometimes not (The problem I described in the description of this issue). I can fix my parsing problem also on application level where Savon first parses the request using Nori and I then have to take care of standardising the response so it results in consistent output. However, that's quite tedious since I have to recursively flatten the empty attributes. It would be easier if there was an option for that in the library itself, hence this feature.

I see that this may be quite specific and that the library may/should not be responsible for that or only be responsible for handling the <foo bar /> and <foo /> case. Unfortunately, that's not too easy to achieve, since <foo bar /> actually isn't valid XML. Nokogiri handles it still since in HTML it's allowed (Or at least tolerated, I'm not 100% if it's actually in the standard, haven't checked :)). But when using rexml, it even raises:

pry(main)> Nori.new(parser: :rexml).parse('<foo bar />')
REXML::ParseException: Missing attribute equal: <bar>
Line: 1
Position: 11
Last 80 unconsumed characters:

from [...]/rexml-3.2.5/lib/rexml/parsers/baseparser.rb:619:in `block in parse_attributes'

so being consistent would only work for Nokogiri. But then, would this be an opt-in feature just for Nokogiri for backwards-compatibility? Would you still expect the attribute to be present somewhere? If yes, wouldn't it be weird that (in the example you gave), result['foo'] is actually nil but result['foo'].attributes returns {'bar'=>''}? And if this behaviour is wanted, it would require extending NilClass which seems rather hacky to me... To circumvent this issue, one could implement only stripping of empty attributes in empty tags, but this would then mean that we're losing information again.

savonrb / nori

Provide option to strip attributes for empty tags #98