Broken Unicode support in Win32 File Provider.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Access a file under Windows which contains Unicode or ANSI characters >= 
0x80.

What is the expected output? What do you see instead?
The Windows file provider incorrectly converts ANSI strings from UTF-8 to 
Unicode and fails to access the file.

What version of the product are you using? On what operating system?
Trunk, rev. 468; Windows Vista

Please provide any additional information below.

I really don't want to sound mean, but the Unicode support for Win32 is wrong 
– by design. All usages of "Utf8ToFilename" will create wrong Unicode 
strings, because the input string simply is not UTF-8 encoded. The input 
strings – the command line arguments are ANSI encoded. Even if you set the 
project type to "Unicode" the "main" function will get invoked with ANSI 
strings. So, what "Utf8ToFilename" is doing is actually converting an ANSI 
string to Unicode – which is something that this function can not do, and 
should not even do. In case the input string contains ANSI characters >= 80 
(e.g. Umlaut A ("Ä")) the "Utf8ToFilename" function will print some error info 
about "invalid continuation character" and will create a wrong Unicode string. 
Thus such files can not get accessed.

I can clearly see that the  developer of the "Utf8ToFilename" invested a lot of 
work to solve all sort of things – but all that work is lost because the 
input data is already wrong. 

To implement Unicode support for Win32 without changing all file name arguments 
in various functions throughout the library to "wstring" you have 2 
possibilities.

1) Convert the ANSI strings from the command line (the "argv" strings) to 
UTF-8, pass them through the library (without changing any function prototypes) 
and convert them eventually from UTF-8 to Unicode in the Win32 file provider. 
That's technically possible, but unfortunately without any gain, because it 
would not lead to real/full Unicode support. Due to that the command line 
arguments were already converted by the C-RTL to ANSI (right before invoking 
'main'), some Unicode characters might be already lost if they can not get 
represented in the ANSI codepage. So, there is no gain to convert the resulting 
characters to Unicode again. Simple, because the relevant Unicode data was 
already discarded.
BTW: Most of the code in "Utf8ToFilename" is redundant with existing Windows 
functions which could have been used for implementing that class. (See "Windows 
Base Services" and "Shell Lightweight API" topics in MSDN.)

2) Use the ANSI strings from the command line (the "argv" strings) simply as 
they are, meaning as ANSI strings and pass them to either the Win32 ANSI API or 
even better to the ANSI-C C-RTL stream functions (e.g. "fopen"). This way, the 
library and command line tools would at least have working ANSI code page 
support, which is fine for almost all users anyway. But right now (with the 
current code in trunk) there is neither Unicode nor ANSI support, right now 
there is only ASCII-7 support. Additionally when using the "fopen" and related 
functions, it would also solve the problem with the missing file IO buffers in 
the Win32 file provider.

In other words, using the already existing POSIX File/FileSystem 
implementations with minimal changes works fine for Win32 and gives at least 
full ANSI support.

I also recommend to add "setlocale(LC_CTYPE, "")" to all "main" functions of 
the tools, this will let the C-RTL output correct encoded ANSI strings (e.g. 
filenames) to the console.

Original issue reported on code.google.com by Skaa...@gmail.com on 30 May 2011 at 7:05

GoogleCodeExporter commented 9 years ago

I see the problem slightly differently, but I guess I'm biased.  From what I 
can tell, the library works fine, or at least is conceptually fine and may have 
a bug here or there.  The command line tools on the other hand, don't work 
fine.  See http://code.google.com/p/mp4v2/issues/detail?id=98#c12 for where I 
left this.  If no one comments on whether my plan is a reasonable one, it'll 
take longer for me to dive in.  Marking this as a duplicate.

PS  Option 2 doesn't allow for long (> ~260 character) filename support.  
Option 1 can work fine if we ignore the argv passed to main and instead get the 
command line args from GetCommandLineW.

Original comment by dbyr...@gmail.com on 30 May 2011 at 7:13

Changed state: Duplicate

GoogleCodeExporter commented 9 years ago

I am aware about that #2 is limited to 260 characters and I am sure that users 
can live with that "limitation" as it applies to 99% of all Windows 
applications anyway. There is nothing to argue against about having 32K 
support, but when this support comes for only ASCII-7 strings, it's maybe not 
worth the effort. Anyway, I agree it's a matter of "point of view".

Please do not consider using "GetCommandLineW", it will lead to lot of work 
with quoted arguments because you will have to parse them by yourself then. If 
you want to go this path, then use at least "wmain(wchar_t *argv[])" – this 
is an alternate "main" function which is to be used by Unicode programs and 
which will give you all command line arguments as Unicode strings. It will also 
solve you all the hassle with parsing quoted, or even double quoted command 
line arguments.

Cheers.

Original comment by Skaa...@gmail.com on 30 May 2011 at 7:27

GoogleCodeExporter commented 9 years ago

I think CommandLineToArgvW resolves the quoting issues when dealing with 
GetCommandLineW.  I think it provides the same thing as wmain.  If not, I'll 
certainly look at wmain.

Original comment by dbyr...@gmail.com on 30 May 2011 at 7:47

GoogleCodeExporter commented 9 years ago

Just to clarify, at some point mp4v2 had _no_ utf support for any input 
arguments, in both the core library and command line tools.  Myself and another 
developer added UTF8 filename support to the core library, but we did not 
update the command line tools to support UTF encoding.  So I agree with dybyron 
on this one; the library works fine (or is at least conceptually fine), but the 
command line tools don't work well.

I think this is a perfectly good issue to track, but let's get a more specific 
bug: something along the lines of "Command Line tools do not support Unicode" 
so we're all on the same page.

Original comment by kid...@gmail.com on 30 May 2011 at 8:52

GoogleCodeExporter commented 9 years ago

Yes, I agree. If the Unicode (UTF-8) support was specified this way, then the 
concept of the library is fine, and it's my fault that I did not read up all 
the documentation and other issues and the replies about this subject.

I just wonder whether the functions in "Utf8ToFilename" and "Utf8DecodeChar" 
won't cause troubles. I can not proof that there is a flaw, nor will I try to 
do so. But from a general point of view, it's questionable to not use the 
Windows API functions for converting from/to Unicode. Even if the Windows API 
functions may have flaws or may be lacking support for some code points under 
certain version of Windows, they are though the functions which will get used 
by application code to create the UTF-8 encoded file names. So, to achieve the 
same "correctness" (or "flaws") for encoding and decoding and therefore the 
available support in the file system of the OS, I recommend to use the same 
function group for encoding and for decoding. This would mean, that if assuming 
that the application code will use "WideCharToMultiByte(CP_UTF8)" for encoding 
a filename, that the library is supposed to use "MultiByteToWideChar(CP_UTF8)". 
Another possibility would be that you expose a new "FilenameToUTF8" utility 
function which is then to be used by application code. This new 
"FilenameToUTF8" would be required to use exactly the same conversion 
functionality (code points, surrogates, etc.) as the "Utf8ToFilename". Either 
way, you could assume that no information gets lost/changed during the 
conversion.

If the Windows API gives a certain Unicode file name to the application, that 
Unicode file name may or may not contain 'flaws' which may be specific to 
certain version of Windows. But, if the application does not change the content 
of that Unicode string, the application can safely assume that it can pass that 
Unicode file name back to the OS and that it will get access to that file. Now, 
if that Unicode string gets encoded/decoded with third party functions, there 
is a chance that something gets changed in the representation of such an 
Unicode string which may lead to an 'invalid' string for certain Windows 
versions when it's eventually passed again to some Unicode file API.

Anyway, it might be more a theoretical issue. The existing code within the 
library will very most likely work for almost all Unicode file names, given 
that the library is used as it was thought from application code other than the 
provided command line tools.

Original comment by Skaa...@gmail.com on 31 May 2011 at 11:32

GoogleCodeExporter commented 9 years ago

Just to be clear, wmain() is a windows thing; that won't work on *nix or OSX:

http://stackoverflow.com/questions/2438297/can-we-use-wmain-functions-with-unix-
compilers-or-itll-work-only-on-windows

...so even if we went this route, we'd still have to #define out different 
stuff on Windows and nix, and stuff would be sort of messy in the command line 
tools.  See here for more info:

http://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv

Original comment by kid...@gmail.com on 24 Jun 2011 at 3:30

stephenyin / mp4v2

Broken Unicode support in Win32 File Provider. #103