Closed VivekPanyam closed 2 years ago
Merging #533 (cde3f5c) into master (c4ed8c3) will increase coverage by
0.00%
. The diff coverage is90.47%
.
@@ Coverage Diff @@
## master #533 +/- ##
=======================================
Coverage 86.58% 86.58%
=======================================
Files 104 104
Lines 6940 6955 +15
=======================================
+ Hits 6009 6022 +13
- Misses 931 933 +2
Impacted Files | Coverage Δ | |
---|---|---|
...rce/neuropod/bindings/java/src/main/native/utils.h | 100.00% <ø> (ø) |
|
...thon_bridge/_neuropod_native_bootstrap/executor.py | 90.14% <50.00%> (-2.51%) |
:arrow_down: |
...rc/main/native/com_uber_neuropod_NeuropodTensor.cc | 72.26% <100.00%> (ø) |
|
...ce/neuropod/bindings/java/src/main/native/utils.cc | 72.30% <100.00%> (+6.26%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c4ed8c3...cde3f5c. Read the comment docs.
Summary:
This PR fixes how the Java bindings deal with Unicode.
Java's Unicode handling methods in JNI (e.g.
GetStringUTFChars
,NewStringUTF
) deal with "modified UTF-8" encoding.According to most descriptions I found online, UTF-8 and modified UTF-8 are quite similar with one of the main differences being how they deal with embedded null bytes:
Unfortunately, this does not mean that UTF-8 and modified UTF-8 are interoperable in cases without embedded null bytes. In fact, the encodings can be quite different and have different lengths.
This causes problems when doing something like this:
because the JNI string methods in the Java bindings will encode strings in "modified UTF-8" and Numpy on the python side will try to decode them as UTF-8.
To fix this issue, I changed the Java bindings to explicitly do a UTF-8 conversion in C++.
Java internally represents strings in UTF-16 so the bindings now get that representation and explicitly convert to UTF-8 (and do the reverse when going from C++ to Java).
Test Plan: