Summary:

This PR fixes how the Java bindings deal with Unicode.

Java's Unicode handling methods in JNI (e.g. GetStringUTFChars, NewStringUTF) deal with "modified UTF-8" encoding.

According to most descriptions I found online, UTF-8 and modified UTF-8 are quite similar with one of the main differences being how they deal with embedded null bytes:

Unfortunately, this does not mean that UTF-8 and modified UTF-8 are interoperable in cases without embedded null bytes. In fact, the encodings can be quite different and have different lengths.

This causes problems when doing something like this:

Java -> Neuropod Java bindings -> C++ -> Neuropod Python Model

because the JNI string methods in the Java bindings will encode strings in "modified UTF-8" and Numpy on the python side will try to decode them as UTF-8.

To fix this issue, I changed the Java bindings to explicitly do a UTF-8 conversion in C++.

Java internally represents strings in UTF-16 so the bindings now get that representation and explicitly convert to UTF-8 (and do the reverse when going from C++ to Java).

Test Plan:

Changed the Java string model tests to include Unicode characters
Added a test to run a Python Neuropod model from Java (failed before implementing the fix, succeeded after the fix)

Codecov Report

Merging #533 (cde3f5c) into master (c4ed8c3) will increase coverage by 0.00%. The diff coverage is 90.47%.

@@           Coverage Diff           @@
##           master     #533   +/-   ##
=======================================
  Coverage   86.58%   86.58%           
=======================================
  Files         104      104           
  Lines        6940     6955   +15     
=======================================
+ Hits         6009     6022   +13     
- Misses        931      933    +2

Impacted Files	Coverage Δ
...rce/neuropod/bindings/java/src/main/native/utils.h	`100.00% <ø> (ø)`
...thon_bridge/_neuropod_native_bootstrap/executor.py	`90.14% <50.00%> (-2.51%)`	:arrow_down:
...rc/main/native/com_uber_neuropod_NeuropodTensor.cc	`72.26% <100.00%> (ø)`
...ce/neuropod/bindings/java/src/main/native/utils.cc	`72.30% <100.00%> (+6.26%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c4ed8c3...cde3f5c. Read the comment docs.

uber / neuropod

Fix Unicode handling in Java bindings #533

Summary:

Test Plan:

Codecov Report