Closed jvnill closed 2 years ago
@jvnill Thanks for pointing out this difference.
This StackOverflow thread has good discussion of the differences.
Thanks for the prompt response @olleolleolle. It's not actually the REXML parser that is the issue. It's the Nori::Parser::REXML
wrapper
require "rexml/document"
class Nori
module Parser
# = Nori::Parser::REXML
#
# REXML pull parser.
module REXML
def self.parse(xml, options)
stack = []
parser = ::REXML::Parsers::BaseParser.new(xml)
while true
event = unnormalize(parser.pull)
case event[0]
when :end_document
break
when :end_doctype, :start_doctype
# do nothing
when :start_element
stack.push Nori::XMLUtilityNode.new(options, event[1], event[2])
when :end_element
if stack.size > 1
temp = stack.pop
stack.last.add_node(temp)
end
when :text, :cdata
stack.last.add_node(event[1]) unless event[1].strip.length == 0 || stack.empty?
end
end
stack.length > 0 ? stack.pop.to_hash : {}
end
def self.unnormalize(event)
event.map! do |el|
if el.is_a?(String)
::REXML::Text.unnormalize(el)
elsif el.is_a?(Hash)
el.each {|k,v| el[k] = ::REXML::Text.unnormalize(v)}
else
el
end
end
end
unnormalize
performs the conversion to unescaped characters but this shouldn't be the case for text inside CDATA.
submitted https://github.com/savonrb/nori/pull/94
Using REXML as parser,
<
inside CDATA is converted to<
. Nokogiri does not have this issue.