tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.21k stars 235 forks source link

How would I parse character references as literal bytes and not codepoints? #667

Open Dekkonot opened 1 year ago

Dekkonot commented 1 year ago

I have an element like this:

<element>&#240;&#159;&#152;&#131;</element>

If those characters are literally interpreted, they should be the byte sequence f0 9f 98 83, which should be U+1F603, or 😃. Instead, it expands to c3 b0 c2 9f c2 98 c2 83 (this sequence is not printable, but you may inspect it here).

This is very much how this is meant to work, and I am aware of that. Unfortunately this decision wasn't made nor is it controlled by me. So, I'd like to know if there's an obvious way to change how escapes are done without having to do it by just iterating through the bytes returned by a Text event.

Mingun commented 1 year ago

I may be wrong, but it seems that you should use

<element>&#x1F603;</element>

instead. Character references are supposed to refer to the Unicode code points directly, not to bytes in some unspecified encoding. A non-normative confirmation of this can be found, for example, here (just the first site from Google), HTML entity for the U+1F603 is &#x1F603;

Dekkonot commented 1 year ago

Right, that is what I would do if given the opportunity. Unfortunately the program that generates these doesn't do it right and I'm left trying to parse it correctly.

I'm filing a bug report with them, but it could take however long to get fixed if it ever does and in the meantime I still have to parse their files.

Mingun commented 1 year ago

Then it seems that it just writes UTF-8 encoded byte arrays for some characters and that byte arrays are encoded as lists of character references. You have to decode the string yourself. Get the raw data using .into_inner() (note, that this bytes may be need to decode first using reader.decoder() if you use non-utf-8 encoding) and convert it to a string by yourself. You will need to copy and modify implementation of unescape

Mingun commented 4 months ago

After merging #766 you will able to resolve character references as you wish (but only in text, not in values of attributes)