XSL Transformation results in Mojibake and does not write out "encoding" to resulting XML file

morbac / xmltools

XML Tools plugin for Notepad++

GNU General Public License v3.0

260 stars 57 forks source link

XSL Transformation results in Mojibake and does not write out "encoding" to resulting XML file #164

Open clang88 opened 2 years ago

clang88 commented 2 years ago

I'm using the latest Notepad ++ (8.4) with XML Tools 3.1.1.13.

My XSL starts like this:

`<?xml version="1.0" encoding="UTF-8"?>

What I would expect, is that my originally UTF-8 formatted XML is transformed, maintaining all special characters and encoding="UTF-8" being added to the declaration. What I get is this however:

`<?xml version="1.0"?>

9 Unspecified Definition--9:gekrümmtes Trassierungselement` Note, the "encoding" is missing in the declaration. Additionally, the "ü" is displayed as the symbol "xFC" in notepad++ and converted back to "ü" when I copy-paste it into this window here. Running the same xsl with Notepad ++ 8.1.5 and XML Tools 3.1.1.6 results in following file: ` 9 Unspecified Definition--9:gekrÃ¼mmtes Trassierungselement` The "Umlaut" is, in this version, irrevocabily butchered, but the "encoding" attribute is written to the declaration. I believe this might be a bug, as when I use a different processing engine, results are as expected. Any ideas?

clang88 commented 2 years ago

Noone else experiencing this issue? Unfortunatly this makes the XSL Transformation function almost unusuable, because you never know in advance what goes wrong.

I'm no seasoned developer and have no experience with C or C++, but if someone could point me to the XSLT code in the repo I can try and find what potentially is causing this issue.

lowellstewart commented 1 year ago

In the past I haven't used the XSL feature much, but today I happened to need it and I ran head-on into the same issue @clang88 ! For my specific use case, the omission of the encoding in the XML header is not too bad... BUT the character encoding issues (in your example, "ü" displaying as "xFC") are show-stoppers for me.

What appears to be happening is, for some reason, even though the current document (the source for the transformation) has UTF-8 encoding, something (somewhere) gets converted into ANSI (Windows-1252) encoding. This is evidenced by your "ü" becoming xFC (which is its Windows-1252/ansi encoded value). In my use case, my XML contains other punctuation characters -- en-dashes, curly quotes, etc. -- and these all come through the XSL transformation showing up with their Windows-1252 single-byte representations also. My en-dashes show up as x96. Curly apostrophes show up as x92. All of these are the single-byte ANSI encodings for these characters. The output file CLAIMS to be UTF-8 ... but that's why we're seeing x96, x92, xFC, etc... because these bytes don't mean anything in UTF-8.

Any chance someone would be willing to look into this? I will see if I can put together a simple test case.

lowellstewart commented 1 year ago

By the way... if anybody else runs into this... my "workaround" is to

run the XSL transformation from XMLTools
on the output file, choose Encoding > ANSI to re-interpret the current file as ANSI instead of UTF-8. (This makes x96 show up as an en-dash, etc., correctly.)
THEN choose Encoding > Convert to UTF-8-bom to ACTUALLY make the file have the desired encoding.
If necessary, add or fix the encoding in the XML file header too.

The above only works for me, because my example XML has characters that are in Windows-1252/ANSI but are encoded differently in UTF-8. If my source file contained other Unicode characters that are outside of Windows-1252, I don't know what would happen -- the workaround would obviously not work though.