Open armanbilge opened 2 years ago
Thanks for reporting. My parser indeed works on decoded characters, so you need to decode it in the right way before going through the XML event pipe. This can be done with fs2-data by using the appropriate import which brings the decoder in scope from Byte
to something that can emit Char
s.
The parser operates then on these bytes without changing the decoding if it does not match the one provided.
I remember thinking about implementing this behavior but I decided to stay with an implementation that is not aware of the encoding. It trusts blindly the strings that were decoded for it.
In the case of your issue, I am not entirely sure how the string is decoded from bytes, I should have a closer look at it.
Thanks for the response! I'm digging into this again since it's all rather confusing.
Currently looking at https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding
Which says:
Each external parsed entity in an XML document may use a different encoding for its characters.
Which again suggests to me, that XML-parsing should operate directly on byte-streams, rather than decoded String
s.
Parsed entities are DTD related (you have internal and external ones). What this says is that every externally defined entity (i.e. in a DTD that is physically in another file) might use a different encoding.
Basically the approach taken in fs2-data is:
In your case, I would expect bodyText
to decode text according to the declared charset (which was set to the source).
I am not familiar with http4s types anymore, I need to dig a bit more to see where the problem would lie. If the text is actually latin9 encoded, and the message declares this charset, I would expect the string coming out of bodyText
to be properly decoded, but I might be wrong, I need to check.
In your case, I would expect
bodyText
to decode text according to the declared charset
Right. I think this situation is testing when there is no declared charset, except for the encoding
specified in the XML itself.
Still, I think are you right that this might be a problem to solve in http4s. Further reading:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
In the test, can you try making an implicit ISO-8859-1 charset available in scope of the test and see if it solves it? bodyText
takes an implicit charset, which is UTF-8 by default.
Right, so the test is currently declared like this:
test("parse omitted charset and 8-Bit MIME Entity") {
// https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
encodingTest(
Chunk.array(
"""<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>""".getBytes(
StandardCharsets.ISO_8859_1
)
),
"application/xml",
"Günther",
)
}
That fails.
However, if I specify the charset like this, then it passes:
diff --git a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
index 8739d94..cfd9622 100644
--- a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
+++ b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
@@ -203,7 +203,7 @@ class ScalaXmlSuite extends CatsEffectSuite with ScalaCheckSuite {
StandardCharsets.ISO_8859_1
)
),
- "application/xml",
+ "application/xml; charset=iso-8859-1",
"Günther",
)
}
But the whole point of that test is to pass without specifying the charset.
test("parse omitted charset and 8-Bit MIME Entity")
No I meant something adding implicit val charset = Charset.forName("iso.8859.15")
or similar in the test itself. Since no encoding is provided in the answer, you need to specify how to decode the body.
Sorry, I guess I'm confused 😕 I understand the code changes you are suggesting (and I expect it will work), but I don't understand what it will demonstrate? Since in practice there would be no way to know what the implicit
charset should be.
@rossabaker if you have a moment to weigh in here it would be appreciated 🙏
@satabin I solved the problem with a small hack in http4s-fs2-data-xml
, see https://github.com/http4s/http4s-fs2-data/blob/main/scala-xml/src/main/scala/org/http4s/scalaxml/ElemInstances.scala#L53-L134.
Would you see this as something we could pull into fs2-data
(partially, of course we don't have HTTP headers, only the XML prolog)? Maybe as some kind of configurable behaviour? I'm a bit on the fence…it brings us closer to the XML spec, but would be one of the least elegant parts of this project as it goes against our abstractions like CharLikeChunks
. Or maybe we find a reasonable way of back-channeling charset info which we can implement in 2.0.
If we don't pull it in or explore alternatives, I would close this issue as http4s has a way forward.
I am not a fan of what I did with the CharLikeChunks
, it is not flexible enough, and I was thinking about a better approach for 2.0, that would allow for this kind of behavior.
The current approach in fs2-data 1.x is to have characters decoded outside of fs2-data, which does not work well with this kind of of format.
I would rather not integrate it currently and think about a better approach for this problem and abstraction and integrate it in 2.0. WDYT?
Very excited about the enhanced XML support in 1.4.0 :) I've been experimenting with it in https://github.com/http4s/http4s-scala-xml/pull/25 and running into trouble with non UTF encodings. FTR I'm no expert in these things :)
For example this request:
as used in this test: https://github.com/http4s/http4s-scala-xml/blob/1ca64f2ab7ef500d384d2ec5f8caf88df600e6a6/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala#L198-L209
Furthermore the RFC specifies:
https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
I'm not sure if there is a way to support this without an XML parser that operates directly on bytes instead of chars/strings 😕 any thoughts? Thanks!