3rd-party JARs without utf-8 bit flag will not behave properly

Eric-Arellano commented 5 years ago

ZIP files can either use UTF-8 or CP437 encoding. To use UTF-8, the 11th bit flag must be specified as defined in section 4.4.4 of the official ZIP spec https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT.

Python 3's Zipfile looks for this bit to determine the encoding to use (https://github.com/python/cpython/blob/master/Lib/zipfile.py#L1507), and defaults to CP437 if the bit is not set.

--

Because JAR files are an extension of ZIP files, this failure to set the flag bit is hitting us with Python 3.

Running ./pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber | grep 'cucumber.api.java.zh_cn' will return unexpected results. While it succeeds, the names are manged:

$ ./pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber | grep 'cucumber.api.java.zh_cn'

cucumber.api.java.zh_cn.Σ╜åµÿ» 3rdparty:cucumber-java
cucumber.api.java.zh_cn.σüçσªé 3rdparty:cucumber-java
cucumber.api.java.zh_cn.σüçσ«Ü 3rdparty:cucumber-java
cucumber.api.java.zh_cn.σüçΦ«╛ 3rdparty:cucumber-java
cucumber.api.java.zh_cn.σÉîµù╢ 3rdparty:cucumber-java
cucumber.api.java.zh_cn.σ╣╢Σ╕ö 3rdparty:cucumber-java
cucumber.api.java.zh_cn.σ╜ô 3rdparty:cucumber-java
cucumber.api.java.zh_cn.ΦÇîΣ╕ö 3rdparty:cucumber-java
cucumber.api.java.zh_cn.ΘéúΣ╣ê 3rdparty:cucumber-java

We expect how Py2 properly handles the case:

$ ./pants classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber | grep 'cucumber.api.java.zh_cn'

cucumber.api.java.zh_cn.但是 3rdparty:cucumber-java
cucumber.api.java.zh_cn.假如 3rdparty:cucumber-java
cucumber.api.java.zh_cn.假定 3rdparty:cucumber-java
cucumber.api.java.zh_cn.假设 3rdparty:cucumber-java
cucumber.api.java.zh_cn.同时 3rdparty:cucumber-java
cucumber.api.java.zh_cn.并且 3rdparty:cucumber-java
cucumber.api.java.zh_cn.当 3rdparty:cucumber-java
cucumber.api.java.zh_cn.而且 3rdparty:cucumber-java
cucumber.api.java.zh_cn.那么 3rdparty:cucumber-java

--

In our code, this issue can be reproduced by applying this diff:

diff --git a/src/python/pants/backend/jvm/tasks/classpath_util.py b/src/python/pants/backend/jvm/tasks/classpath_util.py
index cf7d0a4dd..754e347f4 100644
--- a/src/python/pants/backend/jvm/tasks/classpath_util.py
+++ b/src/python/pants/backend/jvm/tasks/classpath_util.py
@@ -178,6 +178,8 @@ class ClasspathUtil(object):
         # Walk the jar namelist.
         with open_zip(entry, mode='r') as jar:
           for name in jar.namelist():
+            if 'cucumber/api/java/zh_cn/' in name and '.class' in name:
+              import pdb; pdb.set_trace()
             yield ensure_text(name)
       elif os.path.isdir(entry):
         # Walk the directory, including subdirs.

and running ./pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber | grep 'cucumber.api.java.zh_cn'.

The issue comes from context_util.open_zip(), and its call to zipfile.ZipFile(). Note that there is no flag we can pass to ZipFile to make this work. https://stackoverflow.com/questions/41019624/python-zipfile-module-cant-extract-filenames-with-chinese-characters suggests some fixes, including trying to monkey patch zipfile and rewriting the file names beforehand. None of these seem acceptable.

--

Instead, I think the solution requires us to modify jar_create.py and jar_task.py to set the 11th bit flag to signal UTF-8 mode.

https://github.com/pantsbuild/pants/pull/4136 suggests that we would want this flag to be enabled 100% of the time, i.e. that we always want to use UTF-8 and never CP437.

I do not know how to go about setting this bit flag, so any hints appreciated.

--

Once this is fixed, we can remove tests/python/pants_test/backend/jvm/tasks:classmap_integration from build-support/known_py3_pex_failures.txt.

jsirois commented 5 years ago

Instead, I think the solution requires us to modify jar_create.py and jar_task.py to set the 11th bit flag to signal UTF-8 mode.

That would only be a partial solution. There is a wide world of jars we don't create but do consume.

Eric-Arellano commented 5 years ago

There is a wide world of jars we don't create but do consume.

@jsirois for JARs we consume, we don't know yet if it's an issue that they are not properly setting the UTF-8 flag. Many of them may be setting it properly, many of them I would assume are not.

Until we discover it's a problem that upstream JARs are not properly setting the flag and that their failure to do so is negatively impacting the experience for Pants consumers, I vote we focus on fixing our own JARs we create and ignore that greater problem.

Regardless of upstream JARs, we should fix the problem that our own JARs do not have the proper bit flag set.

jsirois commented 5 years ago

Ah, I read 3rdparty:cucumber-java as foreign. I should have read deeper. Sounds good.

OniOni commented 5 years ago

Happy to take a look at this one!

OniOni commented 5 years ago

So I took a deeper look at this one.

1/ So as far as I can tell, when pants creates a jar, it does the correct thing and sets the utf8 flag for filenames (I verified this both by walking through the code and inspecting a jar on my local machine with zipdetails). I can expand on this if needed.

2/ I also traced the behavior for the example you gave @Eric-Arellano (/pants3 classmap testprojects/src/java/org/pantsbuild/testproject/unicode/cucumber to produce the broken encoding output. Sadly the bad jar is cucumber-java-1.1.7.jar and it is a 3rd party jar. And sadly, it does not set the utf8 flag.

$ zipdetails .pants.d/resolve/ivy/7345c2790d72/ivy/jars/info.cukes/cucumber-java/jars/cucumber-java-1.1.7.jar
[snip]
00E4B LOCAL HEADER #40      04034B50
00E4F Extract Zip Spec      0A '1.0'
00E50 Extract OS            00 'MS-DOS'
00E51 General Purpose Flag  0000
      [Bits 1-2]            0 'Normal Compression'
00E53 Compression Method    0008 'Deflated'
00E55 Last Mod Time         44B36A2D 'Mon May 19 13:17:26 2014'
00E59 CRC                   60ED1864
00E5D Compressed Length     00000145
00E61 Uncompressed Length   00000213
00E65 Filename Length       0023
00E67 Extra Length          0000
00E69 Filename              'cucumber/api/java/ar/<D8><A7><D8><B0><D8><A7><D9><8B>.class'
00E8C PAYLOAD
[snip]

Compare to (annotated) zipdetail output for jar-tool:

239A9 General Purpose Flag  0800
      [Bit 11]              1 'Language Encoding'  <---------------- This is what we want
239AB Compression Method    0000 'Stored'

Going to spend some time figuring out how we want to deal with this, also why this was not an issue for python 2.

OniOni commented 5 years ago

Going to continue thinking about this one.

At this point, my thinking is that the "broken" behavior from python3 seems to be the more correct behavior. I'm not sure we want to start guessing if the encoding is correct or not if people are not creating jars properly.

What do you think @Eric-Arellano, @jsirois ?

Eric-Arellano commented 5 years ago

Thank you @OniOni for finding this all. That's extremely helpful!

At this point, my thinking is that the "broken" behavior from python3 seems to be the more correct behavior.

Agreed, Python 3 is complying with the Zip spec in assuming CP437 if the UTF-8 flag is missing. These 10 lines show how Py2 seems to handle encoding: https://github.com/python/cpython/blob/2.7/Lib/zipfile.py#L391. I never see a reference to CP437 when searching the page, unlike Py3.

--

I'm not sure we want to start guessing if the encoding is correct or not if people are not creating jars properly.

I agree with you that we do not want to guess the encoding, as that would violate the principle of least surprise and could lead to unforeseen issues. What if they genuinely wanted CP437? It's worth noting the code does not error out, as it still decodes, just the results are not correct.

--

I found just now that bumping Cucumber to the newest version (from 1.1.7 and 1.2.4 to 1.2.5) fixes the issue! @OniOni is going to clean up the diff and submit a PR to fix this failing test case.

My proposal is for him to fix this failing test case, to add a N.B. explaining this quirk and linking to this issue, and to not change how we encode things / let Zipfile act as it is correctly behaving now.

Eric-Arellano commented 5 years ago

Addressed by https://github.com/pantsbuild/pants/pull/7134.

pantsbuild / pants

3rd-party JARs without utf-8 bit flag will not behave properly #7123