Closed tilo closed 2 months ago
Just wanted to pipe in that our suite of projects maintaining a gaming scripting engine has this same issue. We're panicking a bit thinking we might have to re-write core functionality, or switch XML parsers because of this issue.
The input XML is the following, right?
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>
<SomeResponse>
<Purchase>
<Id>123</Id>
<SyncToken>0</SyncToken>
<MetaData>
<CreateTime>2021-11-23T04:32:03-08:00</CreateTime>
<LastUpdatedTime>2021-11-23T04:32:03-08:00</LastUpdatedTime>
</MetaData>
<TxnDate>2021-11-23</TxnDate>
<CurrencyRef>USD</CurrencyRef>
<Line>
<Id>1</Id>
<Description>some description</Description>
<Amount>123.45</Amount>
<DetailType>AccountBasedExpenseLineDetail</DetailType>
<AccountBasedExpenseLineDetail>
<CustomerRef>1</CustomerRef>
<AccountRef>12</AccountRef>
<BillableStatus>NotBillable</BillableStatus>
<TaxCodeRef>NON</TaxCodeRef>
</AccountBasedExpenseLineDetail>
</Line>
<AccountRef>34</AccountRef>
<PaymentType>CreditCard</PaymentType>
<EntityRef>1</EntityRef>
<Credit>false</Credit>
<TotalAmt>123.45</TotalAmt>
<PurchaseEx>
<NameValue>
<Name>TxnType</Name>
<Value>99</Value>
</NameValue>
</PurchaseEx>
</Purchase>
</SomeResponse>"
This is an invalid XML but REXML 3.3.1 accepts it, right? And REXML 3.3.2 doesn't accept it, right?
The XML isn't accepted with both of ActiveSupport::XmlMini.backend = "Nokogiri"
and ActiveSupport::XmlMini.backend = "LibXML"
but ActiveSupport::XmlMini.backend = "REXML"
with REXML 3.3.1 accepts the XML, right?
Mine is a bit different, in that I'm dealing with XML tags nested inside text. It is the same in that it's worked for years of REXML usage, and broke with 3.3.2 though. This is my offending line.
"<a exist=\"-10949883\" noun=\"Jespin\">He</a> is wearing a <a exist=\"76182094\" noun=\"jacket\">soft black suede jacket</a> with fitted sleeves, a <a exist=\"76182143\" noun=\"cufflinks\">pair of iron and onyx cufflinks</a>, a <a exist=\"76182144\" noun=\"pin\">small golden Paupers pin</a>, a <a exist=\"76182145\" noun=\"satchel\">finely tooled grey leather satchel</a>, a pumpkin-stitched bright <a exist=\"76182149\" noun=\"backpack\">orange backpack</a>, a <a exist=\"76182153\" noun=\"shirt\">pale white spidersilk shirt</a>, a <a exist=\"76182154\" noun=\"apron\">leather crafter's apron</a>, an <a exist=\"76182155\" noun=\"breastplate\">etched vultite metal breastplate</a>, a <a exist=\"76182156\" noun=\"gauntlets\">pair of thick leather gauntlets</a> with polished eonake studs, a <a exist=\"76182157\" noun=\"sack\">large black sack</a>, a <a exist=\"76182158\" noun=\"pouch\">deep indigo eel skin pouch</a>, a <a exist=\"76182196\" noun=\"forging-hammer\">perfect steel-handled mithril forging-hammer</a>, a <a exist=\"76182197\" noun=\"pants\">pair of crisp black linen pants</a>, <a exist=\"76182198\" noun=\"greaves\">some thick spiked ora leg greaves</a>, a <a exist=\"76182199\" noun=\"boots\">pair of thick eonake-toed leather boots</a>, and an <a exist=\"76182200\" noun=\"greatshield\">oval silver steel greatshield</a> lined with crimson plates.\r\n"
Would you prefer mine in a separate issue?
Ah, yes, please.
@kou it is similar
xml_string = "\"<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\" standalone=\\\"yes\\\"?>\n <SomeResponse>\n <Purchase>\n <Id>123</Id>\n <SyncToken>0</SyncToken>\n <MetaData>\n <CreateTime>2021-11-23T04:32:03-08:00</CreateTime>\n <LastUpdatedTime>2021-11-23T04:32:03-08:00</LastUpdatedTime>\n </MetaData>\n <TxnDate>2021-11-23</TxnDate>\n <CurrencyRef>USD</CurrencyRef>\n <Line>\n <Id>1</Id>\n <Description>some description</Description>\n <Amount>123.45</Amount>\n <DetailType>AccountBasedExpenseLineDetail</DetailType>\n <AccountBasedExpenseLineDetail>\n <CustomerRef>1</CustomerRef>\n <AccountRef>12</AccountRef>\n <BillableStatus>NotBillable</BillableStatus>\n <TaxCodeRef>NON</TaxCodeRef>\n </AccountBasedExpenseLineDetail>\n </Line>\n <AccountRef>34</AccountRef>\n <PaymentType>CreditCard</PaymentType>\n <EntityRef>1</EntityRef>\n <Credit>false</Credit>\n <TotalAmt>123.45</TotalAmt>\n <PurchaseEx>\n <NameValue>\n <Name>TxnType</Name>\n <Value>99</Value>\n </NameValue>\n </PurchaseEx>\n </Purchase>\n </SomeResponse>\"\n"
best to cut+paste this into a variable to parse it.
It is similar to the XML you showed, but from what I can tell with incorrect quoting
The XML in https://github.com/ruby/rexml/issues/180#issuecomment-2235239789 was generated by puts xml_string
.
Could you confirm whether my understanding in the comment is correct or not?
Here's another example that parses under 3.3.1 but not 3.3.2:
<a>
<?xml version="1.0" encoding="UTF-8"?>
</a>
I know it's not valid XML, but it was unexpected to see in a patch release.
Hmm. Could you explain why do you want to parse the invalid XML?
It's an (abbreviated) version of a spec testing an API response from a third party service that started failing when I tried to upgrade the gem. I imagine it's just gone under the radar for years being silently swallowed.
Could you ask the third party service whether it's an expected response or not?
Is there a solution or workaround available for this issue, other than just downgrading to 3.3.1?
Could you share your use case? You are parsing invalid XML too?
Hello @kou, yes. We parse XML to create a downloadable document from a stored template. With v3.3.2, parsing XML raises the Malformed XML
error for some documents that used to be just fine with previous v3.3.1. So, yes, parsing XML that now is considered invalid.
Are these XMLs are really invalid? If they are invalid, can you fix them? If they are valid (but an error is reported), could you open a new issue with an XML that reproduces the problem?
Thanks for your reply, @kou. I'll do what you suggest and will let you know.
Hello @kou , the XML files were really invalid and could be fixed "manually". Thank you for your support!
OK. Thanks for sharing your case.
It seems that no more problems around invalid XML will be reported. I close this.
TL,DR:
3.3.1
version was more robust against bad input3.3.2
version breaks on bad inputNot sure if you want to treat this as a regression.
Example
We found that some improper input would still be parsed correctly in
3.3.1
, but breaks parsing in3.3.2
.Some gem in combination with a VCR cassette containing proper XML produces poorly formatted output when doing
response.to_s
, and escapes what are the actual string delimiters"
of the response data, and then appends a\n
to it.So
"string"
becomes `"\"string\"\n" 🤦Clearly incorrect data 🤦 but
3.3.1
was robust against it.