Closed Eric-Arellano closed 5 years ago
Instead, I think the solution requires us to modify jar_create.py and jar_task.py to set the 11th bit flag to signal UTF-8 mode.
That would only be a partial solution. There is a wide world of jars we don't create but do consume.
There is a wide world of jars we don't create but do consume.
@jsirois for JARs we consume, we don't know yet if it's an issue that they are not properly setting the UTF-8 flag. Many of them may be setting it properly, many of them I would assume are not.
Until we discover it's a problem that upstream JARs are not properly setting the flag and that their failure to do so is negatively impacting the experience for Pants consumers, I vote we focus on fixing our own JARs we create and ignore that greater problem.
Regardless of upstream JARs, we should fix the problem that our own JARs do not have the proper bit flag set.
Ah, I read 3rdparty:cucumber-java
as foreign. I should have read deeper. Sounds good.
Happy to take a look at this one!
So I took a deeper look at this one.
1/ So as far as I can tell, when pants creates a jar, it does the correct thing and sets the utf8 flag for filenames (I verified this both by walking through the code and inspecting a jar on my local machine with zipdetails
). I can expand on this if needed.
2/ I also traced the behavior for the example you gave @Eric-Arellano (/pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber
to produce the broken encoding output. Sadly the bad jar is cucumber-java-1.1.7.jar
and it is a 3rd party jar. And sadly, it does not set the utf8 flag.
$ zipdetails .pants.d/resolve/ivy/7345c2790d72/ivy/jars/info.cukes/cucumber-java/jars/cucumber-java-1.1.7.jar
[snip]
00E4B LOCAL HEADER #40 04034B50
00E4F Extract Zip Spec 0A '1.0'
00E50 Extract OS 00 'MS-DOS'
00E51 General Purpose Flag 0000
[Bits 1-2] 0 'Normal Compression'
00E53 Compression Method 0008 'Deflated'
00E55 Last Mod Time 44B36A2D 'Mon May 19 13:17:26 2014'
00E59 CRC 60ED1864
00E5D Compressed Length 00000145
00E61 Uncompressed Length 00000213
00E65 Filename Length 0023
00E67 Extra Length 0000
00E69 Filename 'cucumber/api/java/ar/<D8><A7><D8><B0><D8><A7><D9><8B>.class'
00E8C PAYLOAD
[snip]
Compare to (annotated) zipdetail
output for jar-tool
:
239A9 General Purpose Flag 0800
[Bit 11] 1 'Language Encoding' <---------------- This is what we want
239AB Compression Method 0000 'Stored'
Going to spend some time figuring out how we want to deal with this, also why this was not an issue for python 2.
Going to continue thinking about this one.
At this point, my thinking is that the "broken" behavior from python3 seems to be the more correct behavior. I'm not sure we want to start guessing if the encoding is correct or not if people are not creating jars properly.
What do you think @Eric-Arellano, @jsirois ?
Thank you @OniOni for finding this all. That's extremely helpful!
At this point, my thinking is that the "broken" behavior from python3 seems to be the more correct behavior.
Agreed, Python 3 is complying with the Zip spec in assuming CP437 if the UTF-8 flag is missing. These 10 lines show how Py2 seems to handle encoding: https://github.com/python/cpython/blob/2.7/Lib/zipfile.py#L391
. I never see a reference to CP437 when searching the page, unlike Py3.
--
I'm not sure we want to start guessing if the encoding is correct or not if people are not creating jars properly.
I agree with you that we do not want to guess the encoding, as that would violate the principle of least surprise and could lead to unforeseen issues. What if they genuinely wanted CP437? It's worth noting the code does not error out, as it still decodes, just the results are not correct.
--
I found just now that bumping Cucumber to the newest version (from 1.1.7 and 1.2.4 to 1.2.5) fixes the issue! @OniOni is going to clean up the diff and submit a PR to fix this failing test case.
My proposal is for him to fix this failing test case, to add a N.B.
explaining this quirk and linking to this issue, and to not change how we encode things / let Zipfile act as it is correctly behaving now.
Addressed by https://github.com/pantsbuild/pants/pull/7134.
ZIP files can either use UTF-8 or CP437 encoding. To use UTF-8, the 11th bit flag must be specified as defined in section 4.4.4 of the official ZIP spec https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT.
Python 3's
Zipfile
looks for this bit to determine the encoding to use (https://github.com/python/cpython/blob/master/Lib/zipfile.py#L1507), and defaults to CP437 if the bit is not set.--
Because JAR files are an extension of ZIP files, this failure to set the flag bit is hitting us with Python 3.
Running
./pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber | grep 'cucumber.api.java.zh_cn'
will return unexpected results. While it succeeds, the names are manged:We expect how Py2 properly handles the case:
--
In our code, this issue can be reproduced by applying this diff:
and running
./pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber | grep 'cucumber.api.java.zh_cn'
.The issue comes from
context_util.open_zip()
, and its call tozipfile.ZipFile()
. Note that there is no flag we can pass toZipFile
to make this work. https://stackoverflow.com/questions/41019624/python-zipfile-module-cant-extract-filenames-with-chinese-characters suggests some fixes, including trying to monkey patchzipfile
and rewriting the file names beforehand. None of these seem acceptable.--
Instead, I think the solution requires us to modify
jar_create.py
andjar_task.py
to set the 11th bit flag to signal UTF-8 mode.https://github.com/pantsbuild/pants/pull/4136 suggests that we would want this flag to be enabled 100% of the time, i.e. that we always want to use UTF-8 and never CP437.
I do not know how to go about setting this bit flag, so any hints appreciated.
--
Once this is fixed, we can remove
tests/python/pants_test/backend/jvm/tasks:classmap_integration
frombuild-support/known_py3_pex_failures.txt
.