tklab-tud / uscxml

SCXML interpreter and transformer/compiler written in C/C++ with bindings to Java, C#, Python and Lua
Other
106 stars 54 forks source link

Correct UTF-8 processing across whole infrastucture #144

Closed 1aNam closed 7 years ago

1aNam commented 7 years ago

Setting value of JSC datamodel variable to "ü" and calling JSC DM equality operator <log expr="variable == 'ü'"/> yields null value for logger. When processing SCXML DOM during interpreter initialization these expressions are stored in platform's native ASCII one-byte encoding. At least on Windows. When expressions are later evaluated using method JSCDataModel::evalAsValue and more precisely method JSStringCreateWithUTF8CString they fail to transfer into JSString because source encoding is not UTF-8. JSC library throws an exception in this case.

I somehow managed to solve this problem by modifying methods X(const XMLCh* const toTranscode) and X(const std::string& fromTranscode) in DOM.h for WIN32 specific behaviour: storing content of variable _localForm always in UTF-8 chars and _unicodeForm always in UTF-16 big endian. After doing so, I can successfully run aforementioned DM evaluations and some other things. Please see changes in the file DOM.h in my fork.

This whole thing gets more complicated when interoperability with bindings is needed. I partially solved this issue for csharp bindings. It can be found in this patch.

The test is also ready in UTF-8 (with BOM) encoded scxml file. But you did already make one yourself.

What seems as insoluble related issue is correct UTF-8 output to std::cout from StdOutLogger on Windows. But this seems to be caused by quirks of Windows console.

sradomski commented 7 years ago

I did spend a few UTF8 tests in test/custom and test/src/test-utf8.cpp and they all work fine for me. I will have a look at your files when I have some more time.

sradomski commented 7 years ago

This one looks alot like your attempts in test144.scxml, does it pass with your platform?

1aNam commented 7 years ago

Well, condition "Var1===Var2" leads to pass.

Condition using literal "Var1==='ü'" on the other hand leads to failure. Is passes when comparing ASCII character variable against ASCII literal though.

Furthermore during SCXML init these utf-8 characters are passed into DM as single byte characters when there exist the mapping for them in system's default codepage. In my case it's Czech Windows codepage 1250 and it can cover the ü character. See the attached screenshot from VS debugger. utf8_var

sradomski commented 7 years ago

Still passes as cond="Var1==='ü'" on MacOS. XML preambel claims encoding="UTF-8" - maybe you are not actually using UTF-8 but codepage 1250? Can't you just use proper UTF-8 to author SCXML documents?

Otherwise I cannot see how I would debug this. My assumption is that I can use std::string as it merely serves as a container for the character encoding and ECMAScript uses UCS-2 anyway?

sradomski commented 7 years ago

If you replace both UTF üs by your 0xfc character with the windows 1250 codepage, it no longer works?

1aNam commented 7 years ago

There is some kind of implicit conversion from UTF-8 encoded scxml into windows-1250 chars passed into JS context. That's the 0xFC byte seen on the screenshot. This matches the windows-1250 mapping according to wiki.

Now, when changing scxml file's encoding into windows (both on bytes level and preamble level i.e. <?xml version="1.0" encoding="windows-1250"?>) the same 0xFC character is being passed into JS context with the same failure during subsequent content comparison. This byte is definitely invalid UTF-8 character and who knows what comes from it in the JSC JSStringCreateWithUTF8CString method.

Subsequent attempts to get data's content via dataModel.evalAsData("Var2") lead to null result due to JSC exception in method JSEvaluateScript().

Even if the change of document's encoding into single-byte windows-1250 would somehow help, it would not be good enough due to limited character coverage of such encoding. For instance such encoding would not be capable of storing both ü character and some character from Greek alphabet.

I don't know internals of MacOS, specifically its internal default encoding. Maybe even the LLVM plays some role in the difference we both see. Could you run your test in debug mode and inspect the bytes being passed into JSC DM when processing ü character? Maybe they are proper UTF-8 bytes in your platform.

1aNam commented 7 years ago

I too think that std::string can serve as an universal container for bytes, even UTF-8. ECMA's internal encoding should be UTF-16. Big endian, presumably. The conversion should be done by the JSStringCreateWithUTF8CString you are using. So far, so good.

However I'm not sure whether there is enough of assurance that bytes in std::strings being passed into JSC DM are really without any doubt UTF-8. The Xerces methods being used in the DOM construction during init seem to be insufficient for such assurance on all platforms.

1aNam commented 7 years ago

One last discovery from win platform: when setting variable's value to ü via JSC unicode UTF-16 notation <data id="Var1" expr="'\u00FC'"/>, subsequent call dataModel.evalAsData("Var1") returns std::string with correct UTF-8 representation of ü character i.e. 0xC3BC.

sradomski commented 7 years ago

I will try to force the Win7 installation in my VM to a german codepage and reproduce.

sradomski commented 7 years ago

Codepage 1250 is not supported by the xercesc parser per default. You can try to build xercesc with ICU and see whether this makes a difference. Or you can use UTF-8 or UTF-16 to author SCXML documents. However, Codepage 1252 is supposedly supported without ICU and I'll try to see whether there is something missing to support it.

1aNam commented 7 years ago

The XML encoded in CP 1250 was only for test. Normally I encode all XMLs in UTF-8 with BOM and with <?xml version="1.0" encoding="utf-8"?> preamble.

Doing so, debugger says that Xerces internally stores chars from this UTF-8 source file in UTF-16. After all Xerces internal char processing uses type XMLCh evaluated as wchar_t in WIN32. That's good, we don't lose any char coverage between UTF-8 <-> UTF-16 conversion IMO.

The char cropping (UTF-8/16 -> single byte ANSI CP) seems to take place in uscxml DOM. There is a method X(const XMLCh* const toTranscode) calling function XERCESC_NS::XMLString::transcode and storing its result in the _localForm variable. XERCESC_NS::XMLString::transcode on WIN32 calls Xerces method char* Win32LCPTranscoder::transcode(const XMLCh* const toTranscode, MemoryManager* const manager) which ultimately converts char from UTF-16 to single byte char by means of WIN32 method WideCharToMultiByte(CP_ACP, 0, (LPCWSTR)toTranscode, -1, retVal, (int)neededLen+1, NULL, NULL);

Symbol CP_ACP in its parameters stands for The system default Windows ANSI code page. This value can be different on different computers, even on the same network. It can be changed on the same computer, leading to stored data becoming irrecoverably corrupted. This value is only intended for temporary use and permanent storage should use UTF-16 or UTF-8 if possible.

I did not inspect DOM class code fully. Besides localForm_ there is also _unicodeForm variable. What's the purpose of this duality? Is the unicodeForm_ intended for manipulation on the Xerces level only? Perhaps value of the _localForm can be set to UTF-8 form by custom conversion method effectively bypassing Xerces usage of WideCharToMultiByte with CP_ACP. At least on WIN32 platform.

sradomski commented 7 years ago

I did add some tests for UTF8, ISO-8859-15 and WINDOWS-1252 and they pass on Windows 7 for me. Do they also pass on your platform? There are some issues with ISO-8859-15 as the XML encoding name differs - will fix later.

sradomski commented 7 years ago

Ok, I added tests for UTF-8, UTF-16, ISO-8859-1 and WINDOWS-1252 in the various W3C datamodel tests. Please see whether they pass at your place as well.

1aNam commented 7 years ago

I tried to run following tests.

All of them fail on my system. In all cases the character is passed as a single 0xFC byte into datamodel. I really think the whole issue is related to Xerces somehow. When running test application test-utf8 where there is no XML parsing (data is passed into DM via literals in Data::fromJSON method), full utf-8 bytes are passed into DM.

I'll try to play a little with Xerces build with ICU. I'm already using ICU for JavaScriptCore build so there would be no extra dependency for me.

1aNam commented 7 years ago

As for the tests, it seems to be a good idea to extend their test strings into a form where there are characters from more than one single-byte codepage.

The reason is OS can provide a single-byte codepage covering ü char but the same CP cannot simultaneously cover Hebrew Arabic char ش (from CP WINDOWS-1256) for example. Comparsion of strings üش would thus be more conclusive for proving that UTF-8 is really used for processing.

sradomski commented 7 years ago

I did add considerably more complicated encoding tests. Please tell me whether they pass with your locale.

1aNam commented 7 years ago

The new tests all passed, more precisely the comparisons evaluated positively.

Unfortunately, the unicode values not covered by default (Windows locale dependent) ANSI codepage got malformed during processing. For example all hiragana characters were stored in DOM as default codepage character "?". This is normal behaviour of Windows function MultiByteToWideChar() when performing conversion into ANSI codepage. The comparisons were positive because "?" matches "?". The unicode chars were already lost before comparison.

After building Xerces with --enable-transcoder-icu configure parametr (thanks for the hint!) the only resulting difference was calling of ICU's "Unicode -> ANSI" conversion method instead of MultiByteToWideChar() during DOM init. Namely the method ucnv_fromUChars() was used. It replaced all unicode characters not covered by codepage by char 0x1A. So still no luck, only 0x1A instead of "?".

However, after some research in Xerces's ICU usage I found that default ICU converter set by Xerces uses default platform specific codepage as a conversion target. This is set in Xerces's method XMLLCPTranscoder* ICUTransService::makeNewLCPTranscoder(MemoryManager* manager) in source file ICUTransService.cpp by calling ICU method ucnv_open(NULL, &uerr). Note the NULL value in first parameter - this sets the default system CP as target for conversions. When setting this parameter to value "utf-8" and rebuilding Xerces with this modified ICU converter I got successful conversion into UTF-8 when storing unicode chars in uscxml DOM.

Could you debug unicode test in your Win VM and inspect the bytes used for storage of hiragana chars? I doubt they are stored in UTF-8. They are probably malformed into default CP chars before comparison.

I'm not sure how to proceed now regarding an universal uscxml unicode handling working correctly on all platforms. The ICU way works for me but it involves non-standard build of Xerces, moreover verification of correct behaviour on other platforms would be necessary. My previous attempt to use std::codecvt template on Windows works too but there may be some unresolved issues like compatility with C89 DM.

sradomski commented 7 years ago

Maybe you could clarify with the xerces-c maintainers about the correct usage of ICU? I can't imagine that you'd have to edit their source as xerces-c is as old as XML and this particular implementation is used on many platforms and programming languages. So there ought to be more to it.

sradomski commented 7 years ago

I am going to close this. As far as I can see, there is no issue if you write your XML documents in one of the various encodings supported by Xerces-C by default.

alexzhornyak commented 3 years ago

Hi, the problem exists and issue should be reopen.

Steps to reproduce:

Conditions:

Windows any version, codepage 1251

Actions:

  1. Execute test-enc-UTF8.scxml

Results:

  1. We are getting Var14 from the data element
    char* tmp = XERCESC_NS::XMLString::transcode(toTranscode);
    _localForm = std::string(tmp);

    After conversion we have std::string Var14 ="''В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!''" It is ANSI 1251 string because method transcode transcodes a string to native code-page

Later we perform

JSStringRef scriptJS = JSStringCreateWithUTF8CString(expr.c_str());
JSValueRef exception = NULL;
JSValueRef result = JSEvaluateScript(_ctx, scriptJS, NULL, NULL, 0, &exception);

And we are getting hard exception here because we have different size. JSString expect UTF8 string which must be std::string Var14="'В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!'

Possible Solutions

  1. Change everything to unicode std::wstring
  2. Modify X(const XMLCh* const toTranscode) to force convert to UTF8 string
  3. Convert from native code-page in datamodels