UTF8 source not converted properly under Windows 10

frederickjeanguerin commented 8 years ago

Say we have the following source encoded in UTF-8:

pass
print("Café")

Then, under the Windows 10 command prompt, the tool running with obfuscation will generate the following UTF-8 file (T is a random variable name in this case):

pass
T=print
T("CafÃ©")

NB: The pass instruction on the first line is there only to alleviate another bug that happens on the first line.

TEMPORARY SOLUTION:

I was able to overcome that encoding bug using the windows-1252 encoding format with the source file :

# coding: windows-1252
pass
print("Café")

Then the tool working in obfuscation mode (and all other modes I have tested) will get the correct UTF-8 output:

pass
k=print
k("Café")

NB. The bug does not go away by changing the code page to 65001 (utf8) for the DOS command prompt.

IMPRESSION:

It seems that the tool takes for granted that the encoding is Windows-1252 under DOS/Windows.

b3mb4m commented 8 years ago

Try page 1250; chcp 1250, probably this is not about python just classic windows shit.

frederickjeanguerin commented 8 years ago

Thanks for the feedback. After trying, it seems that chcp 1250 is not making it work. It even triggers an exception, which is not the case with chcp 1252.

frederickjeanguerin commented 8 years ago

After a bit of search, it appears that indeed Python assumes that all text files are 1252 encoded under windows (which is a sad assumption, but that is another story). So it's not the minifier that makes that assumption. However, strangely enough, the minifier seems to be blindly converting the 1252 text file into utf8 somewhere in the process.

To arrive to that conclusion, I wrote the following simple Python script (readwrite.py) which simply reads a text file and outputs its content into a new one:

import sys
with open(sys.argv[1]) as file_in : 
    with open(sys.argv[2], "w") as file_out :
        file_out.write(file_in.read())

When I run that script under windows over either an UTF-8 source, or an 1252 source, it outputs the same file in the same format, which is exactly what should be expected. In fact, for the UTF-8 file, Python assumes the 1252 encoding, and so accented characters are assumed to be two different characters, which are then written as is to the output, and so the output is still perfectly valid UTF-8. However, if I run the minifier like this pyminifier -o out.py --nominify in_utf8.py which simply copies the file, then the accented characters (in strings and/or comments) are not handled correctly (they get converted in the process somewhere).

Here is the utf-8 source file to be minified:

# coding: utf-8
pass
print("Café") # Café

Here is the (utf8) content of the output of pyminifier -o out.py --nominify in_utf8.py:

# coding: utf-8
pass
print("CafÃ©") # CafÃ©
# Created by pyminifier (https://github.com/liftoff/pyminifier)

And the 1252 view of the same output file:

# coding: utf-8
pass
print("CafÃƒÂ©") # CafÃƒÂ©
# Created by pyminifier (https://github.com/liftoff/pyminifier)

Conclusion: The original "é" have been transformed by pyminifier.

rrfaria commented 4 years ago

same issue

kennethtang4 commented 3 years ago

Same issue here. Note that the following method is a quick fix. I personally think that the correct solution for this should allow encoding parameter to be passed similar to the open() file function.

TLDR;

Modify the code in "sites-pacakages/pyminifier/pyminifier.py", line 393.

source = open(args[0]).read() -> source = open(args[0], encoding='utf-8').read()

----------------------------------------------------------------------------------------------------

In the file "sites-pacakages/pyminifier/pyminifier.py" line 393, the file is opened with the command:

source = open(args[0]).read()

which uses the system default encoding. Interesting enough that in line 423, the file opened with the command:

f = open(options.outfile, 'w', encoding='utf-8')

that uses utf-8 encoding to write the file. The issue is caused simply by reading the file as the system default encoding (that is not 'utf-8') and write the file with 'utf-8'.

Though given that the last update of the repository is 5 years ago, I doubt if this will ever be fixed at all.

liftoff / pyminifier

UTF8 source not converted properly under Windows 10 #72