qcam / saxy

Fast SAX parser and encoder for XML in Elixir
https://hexdocs.pm/saxy
MIT License
273 stars 40 forks source link

Saxy does not map \r to \n like SimpleXml does - can we count on this continuing? #119

Closed cmarkle closed 1 year ago

cmarkle commented 1 year ago

We're trying to parse XML output from Amazon S3 APIs when the file/object name might end (inappropriately but it is possible) with \n or \r characters. Right now we are using SweetXml but the case with "xxx\r" is getting parsed to "xxx\r" which, if we returned that as the name to S3, is not what we got in the first place.

Parsing \r | \n with SweetXml.parse:

iex(6)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\r</testattr>") 
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}
iex(7)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\n</testattr>")
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}
iex(8)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#13;</testattr>")
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}
iex(9)> SweetXml.parse("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#10;</testattr>")
{:xmlElement, :testattr, :testattr, [], {:xmlNamespace, [], []}, [], 1, [],
 [{:xmlText, [testattr: 1], 1, [], 'xxx\n', :text}], [],
 '/Users/cmarkle/devel/elixir/xml_example', :undeclared}

...basically \n and \r are each mapped to \n, probably as intended by XML spec.

I see that Saxy distinguishes between these two characters and \r is NOT mapped to \n.

Parsing \r | \n with Saxy.SimpleForm:

iex(2)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\n</testattr>")
{:ok, {"testattr", [], ["xxx\n"]}}
iex(3)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx\r</testattr>")
{:ok, {"testattr", [], ["xxx\r"]}}
iex(4)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#13;</testattr>")
{:ok, {"testattr", [], ["xxx\r"]}}
iex(5)> Saxy.SimpleForm.parse_string("<?xml version='1.0' encoding='UTF-8'?><testattr>xxx&#10;</testattr>")
{:ok, {"testattr", [], ["xxx\n"]}}

...basically \n and \r are treated distinctly, each mapping to same in result

This would be helpful to me in this case, but I guess my question is this something we can count on staying this way with Saxy?

qcam commented 1 year ago

Hi, in general, Saxy treats every character matching CharData the same way. In this particular case, yes you can count on that \n will always be emitted as \n, \r as \r and so on.

cmarkle commented 1 year ago

@qcam Thanks for the clarification. I am going to close this issue.