Do not support chinese path of source file

kinsung commented 7 years ago

Write a file named premake5.lua, and put some .cpp file in the same directory.
If the directory contains .cpp files with chinese character, such as "test测试中文路径.cpp", the generated .vcxproj file cannot be opened by Visual Studio. The reason is .vcxproj file contains a few characters appear as gibberish.

Sample premake5.lua as beblow:

 solution "LibTest"
      configurations { "Debug" }
      location ("build")

  project "LibTest"
    targetname "LibTest"
    language "C++"
    kind "ConsoleApp"
    location ("build")

    files
    {
        "*.cpp", "*.h"
    }

    configuration "Debug"
        targetdir "bin"
        defines { "_DEBUG", "WIN32" }

fun4jimmy commented 7 years ago

This is because FindFirstFileA and FindNextFileA are used in os_matchstart and os_matchnext respectively, the ANSI versions do not return anything for files containing unicode character.

I started on a fix and swapped them out for the Unicode versions which correctly listed unicode files but this caused errors later on in path_normalize when calling isspace. That specific error only occurs in debug builds as there is an debug only check (_ASSERTE(c >= -1 && c <= 255);) but I think there would be other knock on effects in other functions.

The changes also require the use of WideCharToMultiByte and MultiByteToWideChar and calls to malloc/free for temporary strings with every call to os_matchstart and os_matchname which would make the functions slower to run.

A quick look at some other path related functions would required similar changes to fully support Windows Unicode. For example I think os_getcwd will currently not work if you are within a Unicode directory.

I think it would make sense to fully support unicode rather than partially support it from some functions but wanted to check that this is something people want before going any further.

I also wasn't sure what to do in the case of a failure to convert from/to wchar_t:

int os_matchname(lua_State* L)
{
    MatchInfo* m = (MatchInfo*)lua_touserdata(L, 1);
    int widelength = wcslen(m->entry.cFileName);
    int multibytelength = WideCharToMultiByte( CP_UTF8, WC_ERR_INVALID_CHARS, m->entry.cFileName, widelength, NULL, 0, NULL, NULL);
    if (multibytelength != 0)
    {
        char* filename = (char*)malloc(sizeof(char)*(multibytelength+1));
        int result = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, m->entry.cFileName, widelength, filename, multibytelength, NULL, NULL);
        if (result == multibytelength)
        {
            filename[multibytelength] = '\0';
            lua_pushstring(L, filename);
            free(filename);
            return 1;
        }
        free(filename);
    }
    lua_pushstring(L, "");
    return 1;
}

What would be the desired return values if either call to WideCharToMultiByte fails? Do we even need to check the result of the second call?

tvandijck commented 7 years ago

I think the problem is much bigger then just this one method.... lua has no concept of string encoding at all... strings in lua are just byte arrays, and no such thing as encoding is even considered anywhere.

So while it's true you could simply push UTF8 strings into 'string', all the string compares, and most other functions that do any sort of operation on strings would break horribly when a UTF8 escape code is encountered. None of the standard lua methods like io.open, print, printf, etc would function correctly with UTF8.

There is a good page on all the issues and possible solutions with Unicode here: http://lua-users.org/wiki/LuaUnicode

fun4jimmy commented 7 years ago

Functions like strcmp don't break with UTF8 do they? I've previously pushed UTF8 filenames in to strings at work and they have worked so far. We haven't done very thorough testing so you could be correct about UTF8 escape codes.

You're definitely correct about the built in lua io.open breaking, I guess it would at the very least not open the correct file. If as part of Unicode support there was also a UTF8 friendly override of io.open provided then that would work around that problem. print and printf would display some nonsense but as long as people are aware of that when dealing with Unicode filenames would that be acceptable?

Using the ANSI version of all the windows file functions also means that premake can't support extended length paths. As msbuild/Visual Studio also falls foul of this when compiling I guess it's not a high priority for premake to support them.

tvandijck commented 7 years ago

Functions like strcmp don't break with UTF8 do they?

from a purely byte to byte compare they indeed don't break... however, there is at least 4 different ways of encoding any particular character that would all result in the same "wchar_t", so from a purely UTF8 point of view strcmp would fail in such cases... so better would be to expand the UTF8 string to wchar_t and do a wstrcmp or whatever that is called. Or do the compare directly in UTF8... the ICU library for example supports this very well...

I don't know how hard it would be to fix this across the board for premake, but I encourage any effort for sure ;)

what if calls to WideCharToMultiByte fails?

Those only fail when an incorrect UTF8 sequence is given, which is rather hard to do, but not impossible I guess... I would simply exit premake with an error, since it would most definitely be a script error, and the os_matchname wouldn't match anything anyway...

fun4jimmy commented 7 years ago

Ah I didn't realise that there were different encodings for the same character. To start with it would be easiest to just push the UTF8 strings in to lua strings and not handle those failing cases, this would at least support the case that this bug mentions. Adding UTF8 string comparison could be a later additional feature. Would it be best to add UTF8 string as a new type to lua or try and override the default string behaviour? I guess some of the UTF8 lua libraries mentioned on that LuaUnicode link might be the way to go.

Would adding ICU as a dependency to premake be acceptable or should only the bits needed be implemented in premake? I know that ICU is quite a large library (I've compiled it as a Qt dependency previously) and I doubt all of it would be needed.

Also I'm not sure how to unittest code that deals with the filesystem. Writing temporary files seems a bit messy and also relies on user permissions. Python has a nice package called pyfakefs that mocks the filesystem but that's not really possible in c++. The best recommendation is to write a file system interface and reimplement that for tests. That wouldn't work for the built in lua file system functions although I guess they would need reimplementing for Unicode support anyway. Any ideas?

tvandijck commented 7 years ago

For testing, lua is just tables of function pointers, and we already mock some of them during testing. Not all of the FS related ones, but we could I guess..

As for adding UTF8.... ICU is much more then just UTF8, and 16MB of source, which is going be te quite the dependency... I can't imagine that adding much value over trying to find or write a UTF8 only library. It would be awesome if we could simply introduce a "wstring" type in lua... or a "utf8string" that handles these cases, rather then extending "string"... it would also seperate utf8/w string operations from normal string operations...

But all of this... should really be done on the lua side of things, I don't think it is necessarily really a premake thing... I'd rather see this kind of changes and support be done on github.com/lua/lua basically. I'm doubtful they'd take it... but regardless of that, I think that is the level where this change needs to happen... we can obviously always embed that into premake if lua doesn't want it, but I think it should be written that way.

tvandijck commented 7 years ago

http://www.flexiguided.de/publications.utf8proc.en.html

it's just 3 files to add to your source... one of them is 1.6MB, but I guess that is how unicode tables are..

starkos commented 7 years ago

In that earlier link, they mention "since Lua 5.3, there's a builtin module called "utf8". Maybe a reason to bite the bullet and upgrade?

tvandijck commented 7 years ago

Ah yes... there is indeed a lutf8lib.c is my version of premake ;) We've been on 5.3 for a while...

Anyway, it has fairly limited support for utf strings... only 5 methods.

static const luaL_Reg funcs[] = {
  {"offset", byteoffset},
  {"codepoint", codepoint},
  {"char", utfchar},
  {"len", utflen},
  {"codes", iter_codes},
  /* placeholders */
  {"charpattern", NULL},
  {NULL, NULL}
};

but it's a start for sure.

fun4jimmy commented 7 years ago

Does that gain anything over pushing utf8 strings in to normal strings? The utf8proc library or something similar would still be needed for UTF8 string comparison and all the IO/file system functions would still need overriding.

tvandijck commented 7 years ago

I'm closing this since we merged #728. if this is still an issue, please reopen.

premake / premake-core

Do not support chinese path of source file #629