gnieh / fs2-data

streaming data parsing and transformation library
https://fs2-data.gnieh.org
Apache License 2.0
152 stars 27 forks source link

Support for non-UTF `encoding` in xml parser #332

Open armanbilge opened 2 years ago

armanbilge commented 2 years ago

Very excited about the enhanced XML support in 1.4.0 :) I've been experimenting with it in https://github.com/http4s/http4s-scala-xml/pull/25 and running into trouble with non UTF encodings. FTR I'm no expert in these things :)

For example this request:

Content-Type: application/xml

<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>

as used in this test: https://github.com/http4s/http4s-scala-xml/blob/1ca64f2ab7ef500d384d2ec5f8caf88df600e6a6/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala#L198-L209

Furthermore the RFC specifies:

Since the charset parameter is not provided in the Content-Type
header and there is no overriding BOM, conformant XML processors must
treat the "iso-8859-1" encoding as authoritative.  Conformant XML-
unaware MIME processors should make no assumptions about the
character encoding of the XML MIME entity.

https://datatracker.ietf.org/doc/html/rfc7303#section-8.3

I'm not sure if there is a way to support this without an XML parser that operates directly on bytes instead of chars/strings 😕 any thoughts? Thanks!

satabin commented 2 years ago

Thanks for reporting. My parser indeed works on decoded characters, so you need to decode it in the right way before going through the XML event pipe. This can be done with fs2-data by using the appropriate import which brings the decoder in scope from Byte to something that can emit Chars.

The parser operates then on these bytes without changing the decoding if it does not match the one provided.

I remember thinking about implementing this behavior but I decided to stay with an implementation that is not aware of the encoding. It trusts blindly the strings that were decoded for it.

In the case of your issue, I am not entirely sure how the string is decoded from bytes, I should have a closer look at it.

armanbilge commented 2 years ago

Thanks for the response! I'm digging into this again since it's all rather confusing.

Currently looking at https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

Which says:

Each external parsed entity in an XML document may use a different encoding for its characters.

Which again suggests to me, that XML-parsing should operate directly on byte-streams, rather than decoded Strings.

satabin commented 2 years ago

Parsed entities are DTD related (you have internal and external ones). What this says is that every externally defined entity (i.e. in a DTD that is physically in another file) might use a different encoding.

satabin commented 2 years ago

Basically the approach taken in fs2-data is:

In your case, I would expect bodyText to decode text according to the declared charset (which was set to the source). I am not familiar with http4s types anymore, I need to dig a bit more to see where the problem would lie. If the text is actually latin9 encoded, and the message declares this charset, I would expect the string coming out of bodyText to be properly decoded, but I might be wrong, I need to check.

armanbilge commented 2 years ago

In your case, I would expect bodyText to decode text according to the declared charset

Right. I think this situation is testing when there is no declared charset, except for the encoding specified in the XML itself.

Still, I think are you right that this might be a problem to solve in http4s. Further reading:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

satabin commented 2 years ago

In the test, can you try making an implicit ISO-8859-1 charset available in scope of the test and see if it solves it? bodyText takes an implicit charset, which is UTF-8 by default.

armanbilge commented 2 years ago

Right, so the test is currently declared like this:

  test("parse omitted charset and 8-Bit MIME Entity") {
    // https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
    encodingTest(
      Chunk.array(
        """<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>""".getBytes(
          StandardCharsets.ISO_8859_1
        )
      ),
      "application/xml",
      "Günther",
    )
  }

That fails.

However, if I specify the charset like this, then it passes:

diff --git a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
index 8739d94..cfd9622 100644
--- a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
+++ b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
@@ -203,7 +203,7 @@ class ScalaXmlSuite extends CatsEffectSuite with ScalaCheckSuite {
           StandardCharsets.ISO_8859_1
         )
       ),
-      "application/xml",
+      "application/xml; charset=iso-8859-1",
       "Günther",
     )
   }

But the whole point of that test is to pass without specifying the charset.

test("parse omitted charset and 8-Bit MIME Entity")
satabin commented 2 years ago

No I meant something adding implicit val charset = Charset.forName("iso.8859.15") or similar in the test itself. Since no encoding is provided in the answer, you need to specify how to decode the body.

armanbilge commented 2 years ago

Sorry, I guess I'm confused 😕 I understand the code changes you are suggesting (and I expect it will work), but I don't understand what it will demonstrate? Since in practice there would be no way to know what the implicit charset should be.

armanbilge commented 2 years ago

@rossabaker if you have a moment to weigh in here it would be appreciated 🙏

ybasket commented 1 year ago

@satabin I solved the problem with a small hack in http4s-fs2-data-xml, see https://github.com/http4s/http4s-fs2-data/blob/main/scala-xml/src/main/scala/org/http4s/scalaxml/ElemInstances.scala#L53-L134.

Would you see this as something we could pull into fs2-data (partially, of course we don't have HTTP headers, only the XML prolog)? Maybe as some kind of configurable behaviour? I'm a bit on the fence…it brings us closer to the XML spec, but would be one of the least elegant parts of this project as it goes against our abstractions like CharLikeChunks. Or maybe we find a reasonable way of back-channeling charset info which we can implement in 2.0.

If we don't pull it in or explore alternatives, I would close this issue as http4s has a way forward.

satabin commented 1 year ago

I am not a fan of what I did with the CharLikeChunks, it is not flexible enough, and I was thinking about a better approach for 2.0, that would allow for this kind of behavior.

The current approach in fs2-data 1.x is to have characters decoded outside of fs2-data, which does not work well with this kind of of format.

I would rather not integrate it currently and think about a better approach for this problem and abstraction and integrate it in 2.0. WDYT?