uber / neuropod

A uniform interface to run deep learning models from multiple frameworks
https://neuropod.ai
Apache License 2.0
936 stars 77 forks source link

Fix Unicode handling in Java bindings #533

Closed VivekPanyam closed 2 years ago

VivekPanyam commented 2 years ago

Summary:

This PR fixes how the Java bindings deal with Unicode.

Java's Unicode handling methods in JNI (e.g. GetStringUTFChars, NewStringUTF) deal with "modified UTF-8" encoding.

According to most descriptions I found online, UTF-8 and modified UTF-8 are quite similar with one of the main differences being how they deal with embedded null bytes:

Unfortunately, this does not mean that UTF-8 and modified UTF-8 are interoperable in cases without embedded null bytes. In fact, the encodings can be quite different and have different lengths.

This causes problems when doing something like this:

Java -> Neuropod Java bindings -> C++ -> Neuropod Python Model

because the JNI string methods in the Java bindings will encode strings in "modified UTF-8" and Numpy on the python side will try to decode them as UTF-8.

To fix this issue, I changed the Java bindings to explicitly do a UTF-8 conversion in C++.

Java internally represents strings in UTF-16 so the bindings now get that representation and explicitly convert to UTF-8 (and do the reverse when going from C++ to Java).

Test Plan:

codecov[bot] commented 2 years ago

Codecov Report

Merging #533 (cde3f5c) into master (c4ed8c3) will increase coverage by 0.00%. The diff coverage is 90.47%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #533   +/-   ##
=======================================
  Coverage   86.58%   86.58%           
=======================================
  Files         104      104           
  Lines        6940     6955   +15     
=======================================
+ Hits         6009     6022   +13     
- Misses        931      933    +2     
Impacted Files Coverage Δ
...rce/neuropod/bindings/java/src/main/native/utils.h 100.00% <ø> (ø)
...thon_bridge/_neuropod_native_bootstrap/executor.py 90.14% <50.00%> (-2.51%) :arrow_down:
...rc/main/native/com_uber_neuropod_NeuropodTensor.cc 72.26% <100.00%> (ø)
...ce/neuropod/bindings/java/src/main/native/utils.cc 72.30% <100.00%> (+6.26%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c4ed8c3...cde3f5c. Read the comment docs.