savonrb / nori

XML to Hash translator
MIT License
247 stars 74 forks source link

REXML parsing error #93

Closed jvnill closed 2 years ago

jvnill commented 2 years ago

Using REXML as parser, &lt; inside CDATA is converted to <. Nokogiri does not have this issue.

irb(main):030:0> Nori.new(parser: :rexml).parse("<lixi_payload><![CDATA[<RealEstate Zoning=\"Residential (Development/Existing) &lt;=6 units/dwellings\"></RealEstate>]]></lixi_payload>")
=> {"lixi_payload"=>"<RealEstate Zoning=\"Residential (Development/Existing) <=6 units/dwellings\"></RealEstate>"}

irb(main):031:0> Nori.new(parser: :nokogiri).parse("<lixi_payload><![CDATA[<RealEstate Zoning=\"Residential (Development/Existing) &lt;=6 units/dwellings\"></RealEstate>]]></lixi_payload>")
=> {"lixi_payload"=>"<RealEstate Zoning=\"Residential (Development/Existing) &lt;=6 units/dwellings\"></RealEstate>"}
olleolleolle commented 2 years ago

@jvnill Thanks for pointing out this difference.

This StackOverflow thread has good discussion of the differences.

jvnill commented 2 years ago

Thanks for the prompt response @olleolleolle. It's not actually the REXML parser that is the issue. It's the Nori::Parser::REXML wrapper

require "rexml/document"

class Nori
  module Parser

    # = Nori::Parser::REXML
    #
    # REXML pull parser.
    module REXML

      def self.parse(xml, options)
        stack = []
        parser = ::REXML::Parsers::BaseParser.new(xml)

        while true
          event = unnormalize(parser.pull)
          case event[0]
          when :end_document
            break
          when :end_doctype, :start_doctype
            # do nothing
          when :start_element
            stack.push Nori::XMLUtilityNode.new(options, event[1], event[2])
          when :end_element
            if stack.size > 1
              temp = stack.pop
              stack.last.add_node(temp)
            end
          when :text, :cdata
            stack.last.add_node(event[1]) unless event[1].strip.length == 0 || stack.empty?
          end
        end
        stack.length > 0 ? stack.pop.to_hash : {}
      end

      def self.unnormalize(event)
        event.map! do |el|
          if el.is_a?(String)
            ::REXML::Text.unnormalize(el)
          elsif el.is_a?(Hash)
            el.each {|k,v| el[k] = ::REXML::Text.unnormalize(v)}
          else
            el
          end
        end
      end

unnormalize performs the conversion to unescaped characters but this shouldn't be the case for text inside CDATA.

jvnill commented 2 years ago

submitted https://github.com/savonrb/nori/pull/94