leethomason / tinyxml2

TinyXML2 is a simple, small, efficient, C++ XML parser that can be easily integrated into other programs.
zlib License
5.04k stars 1.83k forks source link

Windows and Utf-8 #885

Open rpatters1 opened 2 years ago

rpatters1 commented 2 years ago

For the Windows compiler I noticed that fopen_s is used without an encoding specifier in LoadFile. According to the docs that means the file will be encoded ANSI rather than utf-8. I'm wondering if it should have the ccs=UTF-8 encoding specifier added explicitly.

In particular I am concerned about whether to supply the file path encoded ANSI or utf-8. This becomes a huge pain when dealing with file paths that contain non-ASCII characters, e.g. 'ƒ', and I'd like to get it right. It would be nice if tinyxml2 guaranteed that a utf-8-encoded file path would work on all platforms. (Or, alternatively, documented the correct encoding for file paths on each platform.)

ajtruckle commented 2 years ago

I use:

// Now try to create a FILE buffer (allows UNICODE filenames)
const auto eResult = _tfopen_s(&fStream, strFileXML, _T("rb"));
const XMLError eXML = rDocXML.LoadFile(fStream);

This approach allows unicode paths.

rpatters1 commented 2 years ago

If I understand it correctly, _tfopen_s allows WCHAR (i.e., utf-16) filenames without the need for explicitly specifying WCHAR. However, I don't believe that addresses how/whether to deal with utf-8 encoded file names.

In either case, I believe there are really two issues for tinyxml2.

  1. Should tinyxml2 be explicitly specifying utf-8 encoding on whichever flavor of fopen_s it uses? I ask because the doc pages say it always uses utf-8 encoding.
  2. Given the file contents are documented to be utf-8 always, should the file paths have the same requirement? Or in any case, should there be something in the docs about the encoding of file paths?

I appreciate you mentioning _tfopen_s. I may look into it for my project that's calling tinyxml2.

CookiePLMonster commented 2 years ago

Windows flavour of callfopen could trivially be modified to allow for UTF-8 paths, but it has a downside that might be a deal breaker for this feature - the only truly reliable, side-effect free way of converting a UTF-8 text to UTF-16 is MultiByteToWideChar but that requires including windows.h.

Not to mention the potential backwards compatibility issues... it's no telling how many projects using tinyxml2 currently assume LoadFile/SaveFile operate on the local codepage, and all those cases would quietly break when upgraded to a newer tinyxml2.

Honestly, I hit the very same issue not 3 days ago so I am on the same boat, but given those backwards compatibility concerns I think the most viable approach is to stick to LoadFile/SaveFile overloads that take a FILE* instead of the path. If you wrap it into a tiny class to use with tinyxml2, you could keep Load/Save as one liners just like now.