ruby / rexml

REXML is an XML toolkit for Ruby
BSD 2-Clause "Simplified" License
133 stars 61 forks source link

Behavior Change from 3.3.1 -> 3.3.2 #180

Closed tilo closed 1 week ago

tilo commented 1 month ago

TL,DR:

Not sure if you want to treat this as a regression.

Example

We found that some improper input would still be parsed correctly in 3.3.1, but breaks parsing in 3.3.2.

Some gem in combination with a VCR cassette containing proper XML produces poorly formatted output when doing response.to_s, and escapes what are the actual string delimiters " of the response data, and then appends a \n to it.

So "string" becomes `"\"string\"\n" 🤦

Clearly incorrect data 🤦 but 3.3.1 was robust against it.

pry(RestClient)> xml = response.to_s
=> "\"<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\" standalone=\\\"yes\\\"?>\n  <SomeResponse>\n    <Purchase>\n      <Id>123</Id>\n      <SyncToken>0</SyncToken>\n      <MetaData>\n        <CreateTime>2021-11-23T04:32:03-08:00</CreateTime>\n        <LastUpdatedTime>2021-11-23T04:32:03-08:00</LastUpdatedTime>\n      </MetaData>\n      <TxnDate>2021-11-23</TxnDate>\n      <CurrencyRef>USD</CurrencyRef>\n      <Line>\n        <Id>1</Id>\n        <Description>some description</Description>\n        <Amount>123.45</Amount>\n        <DetailType>AccountBasedExpenseLineDetail</DetailType>\n        <AccountBasedExpenseLineDetail>\n          <CustomerRef>1</CustomerRef>\n          <AccountRef>12</AccountRef>\n          <BillableStatus>NotBillable</BillableStatus>\n          <TaxCodeRef>NON</TaxCodeRef>\n        </AccountBasedExpenseLineDetail>\n      </Line>\n      <AccountRef>34</AccountRef>\n      <PaymentType>CreditCard</PaymentType>\n      <EntityRef>1</EntityRef>\n      <Credit>false</Credit>\n      <TotalAmt>123.45</TotalAmt>\n      <PurchaseEx>\n        <NameValue>\n          <Name>TxnType</Name>\n          <Value>99</Value>\n        </NameValue>\n      </PurchaseEx>\n    </Purchase>\n  </SomeResponse>\"\n"

# with 3.3.1:

pry(main)> Hash.from_xml xml
=> {"SomeResponse"=>
  {"Purchase"=>
    {"Id"=>"123",
     "SyncToken"=>"0",
     "MetaData"=>{"CreateTime"=>"2021-11-23T04:32:03-08:00", "LastUpdatedTime"=>"2021-11-23T04:32:03-08:00"},
     "TxnDate"=>"2021-11-23",
     "CurrencyRef"=>"USD",
     "Line"=>{"Id"=>"1", "Description"=>"some transaction", "Amount"=>"123.45", "DetailType"=>"AccountBasedExpenseLineDetail", "AccountBasedExpenseLineDetail"=>{"CustomerRef"=>"1", "AccountRef"=>"12", "BillableStatus"=>"NotBillable", "TaxCodeRef"=>"NON"}},
     "AccountRef"=>"34",
     "PaymentType"=>"CreditCard",
     "EntityRef"=>"1",
     "Credit"=>"false",
     "TotalAmt"=>"123.45",
     "PurchaseEx"=>{"NameValue"=>{"Name"=>"TxnType", "Value"=>"99"}}}}}

# with 3.3.2:

pry(RestClient)> Hash.from_xml xml
REXML::ParseException: Malformed XML: Extra content at the end of the document (got '"
')
Line: 36
Position: 1176
Last 80 unconsumed characters:
MahtraDR commented 1 month ago

Just wanted to pipe in that our suite of projects maintaining a gaming scripting engine has this same issue. We're panicking a bit thinking we might have to re-write core functionality, or switch XML parsers because of this issue.

kou commented 1 month ago

The input XML is the following, right?

"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>
  <SomeResponse>
    <Purchase>
      <Id>123</Id>
      <SyncToken>0</SyncToken>
      <MetaData>
        <CreateTime>2021-11-23T04:32:03-08:00</CreateTime>
        <LastUpdatedTime>2021-11-23T04:32:03-08:00</LastUpdatedTime>
      </MetaData>
      <TxnDate>2021-11-23</TxnDate>
      <CurrencyRef>USD</CurrencyRef>
      <Line>
        <Id>1</Id>
        <Description>some description</Description>
        <Amount>123.45</Amount>
        <DetailType>AccountBasedExpenseLineDetail</DetailType>
        <AccountBasedExpenseLineDetail>
          <CustomerRef>1</CustomerRef>
          <AccountRef>12</AccountRef>
          <BillableStatus>NotBillable</BillableStatus>
          <TaxCodeRef>NON</TaxCodeRef>
        </AccountBasedExpenseLineDetail>
      </Line>
      <AccountRef>34</AccountRef>
      <PaymentType>CreditCard</PaymentType>
      <EntityRef>1</EntityRef>
      <Credit>false</Credit>
      <TotalAmt>123.45</TotalAmt>
      <PurchaseEx>
        <NameValue>
          <Name>TxnType</Name>
          <Value>99</Value>
        </NameValue>
      </PurchaseEx>
    </Purchase>
  </SomeResponse>"

This is an invalid XML but REXML 3.3.1 accepts it, right? And REXML 3.3.2 doesn't accept it, right?

kou commented 1 month ago

The XML isn't accepted with both of ActiveSupport::XmlMini.backend = "Nokogiri" and ActiveSupport::XmlMini.backend = "LibXML" but ActiveSupport::XmlMini.backend = "REXML" with REXML 3.3.1 accepts the XML, right?

MahtraDR commented 1 month ago

Mine is a bit different, in that I'm dealing with XML tags nested inside text. It is the same in that it's worked for years of REXML usage, and broke with 3.3.2 though. This is my offending line.

"<a exist=\"-10949883\" noun=\"Jespin\">He</a> is wearing a <a exist=\"76182094\" noun=\"jacket\">soft black suede jacket</a> with fitted sleeves, a <a exist=\"76182143\" noun=\"cufflinks\">pair of iron and onyx cufflinks</a>, a <a exist=\"76182144\" noun=\"pin\">small golden Paupers pin</a>, a <a exist=\"76182145\" noun=\"satchel\">finely tooled grey leather satchel</a>, a pumpkin-stitched bright <a exist=\"76182149\" noun=\"backpack\">orange backpack</a>, a <a exist=\"76182153\" noun=\"shirt\">pale white spidersilk shirt</a>, a <a exist=\"76182154\" noun=\"apron\">leather crafter's apron</a>, an <a exist=\"76182155\" noun=\"breastplate\">etched vultite metal breastplate</a>, a <a exist=\"76182156\" noun=\"gauntlets\">pair of thick leather gauntlets</a> with polished eonake studs, a <a exist=\"76182157\" noun=\"sack\">large black sack</a>, a <a exist=\"76182158\" noun=\"pouch\">deep indigo eel skin pouch</a>, a <a exist=\"76182196\" noun=\"forging-hammer\">perfect steel-handled mithril forging-hammer</a>, a <a exist=\"76182197\" noun=\"pants\">pair of crisp black linen pants</a>, <a exist=\"76182198\" noun=\"greaves\">some thick spiked ora leg greaves</a>, a <a exist=\"76182199\" noun=\"boots\">pair of thick eonake-toed leather boots</a>, and an <a exist=\"76182200\" noun=\"greatshield\">oval silver steel greatshield</a> lined with crimson plates.\r\n"

Would you prefer mine in a separate issue?

kou commented 1 month ago

Ah, yes, please.

tilo commented 1 month ago

@kou it is similar


xml_string = "\"<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\" standalone=\\\"yes\\\"?>\n  <SomeResponse>\n    <Purchase>\n      <Id>123</Id>\n      <SyncToken>0</SyncToken>\n      <MetaData>\n        <CreateTime>2021-11-23T04:32:03-08:00</CreateTime>\n        <LastUpdatedTime>2021-11-23T04:32:03-08:00</LastUpdatedTime>\n      </MetaData>\n      <TxnDate>2021-11-23</TxnDate>\n      <CurrencyRef>USD</CurrencyRef>\n      <Line>\n        <Id>1</Id>\n        <Description>some description</Description>\n        <Amount>123.45</Amount>\n        <DetailType>AccountBasedExpenseLineDetail</DetailType>\n        <AccountBasedExpenseLineDetail>\n          <CustomerRef>1</CustomerRef>\n          <AccountRef>12</AccountRef>\n          <BillableStatus>NotBillable</BillableStatus>\n          <TaxCodeRef>NON</TaxCodeRef>\n        </AccountBasedExpenseLineDetail>\n      </Line>\n      <AccountRef>34</AccountRef>\n      <PaymentType>CreditCard</PaymentType>\n      <EntityRef>1</EntityRef>\n      <Credit>false</Credit>\n      <TotalAmt>123.45</TotalAmt>\n      <PurchaseEx>\n        <NameValue>\n          <Name>TxnType</Name>\n          <Value>99</Value>\n        </NameValue>\n      </PurchaseEx>\n    </Purchase>\n  </SomeResponse>\"\n"

best to cut+paste this into a variable to parse it.

It is similar to the XML you showed, but from what I can tell with incorrect quoting

kou commented 1 month ago

The XML in https://github.com/ruby/rexml/issues/180#issuecomment-2235239789 was generated by puts xml_string.

Could you confirm whether my understanding in the comment is correct or not?

scpike commented 1 month ago

Here's another example that parses under 3.3.1 but not 3.3.2:

<a>
    <?xml version="1.0" encoding="UTF-8"?>
</a>

I know it's not valid XML, but it was unexpected to see in a patch release.

kou commented 1 month ago

Hmm. Could you explain why do you want to parse the invalid XML?

scpike commented 1 month ago

It's an (abbreviated) version of a spec testing an API response from a third party service that started failing when I tried to upgrade the gem. I imagine it's just gone under the radar for years being silently swallowed.

kou commented 1 month ago

Could you ask the third party service whether it's an expected response or not?

Adsidera commented 1 month ago

Is there a solution or workaround available for this issue, other than just downgrading to 3.3.1?

kou commented 1 month ago

Could you share your use case? You are parsing invalid XML too?

Adsidera commented 1 month ago

Hello @kou, yes. We parse XML to create a downloadable document from a stored template. With v3.3.2, parsing XML raises the Malformed XML error for some documents that used to be just fine with previous v3.3.1. So, yes, parsing XML that now is considered invalid.

kou commented 1 month ago

Are these XMLs are really invalid? If they are invalid, can you fix them? If they are valid (but an error is reported), could you open a new issue with an XML that reproduces the problem?

Adsidera commented 1 month ago

Thanks for your reply, @kou. I'll do what you suggest and will let you know.

Adsidera commented 1 week ago

Hello @kou , the XML files were really invalid and could be fixed "manually". Thank you for your support!

kou commented 1 week ago

OK. Thanks for sharing your case.

It seems that no more problems around invalid XML will be reported. I close this.