openvenues / jpostal

Java/JNI bindings to libpostal for for fast international street address parsing/normalization
MIT License
106 stars 44 forks source link

Fix modified utf-8 issues #38

Open wboult opened 4 years ago

wboult commented 4 years ago

I recently bumped into JNI's usage of modified UTF-8, which was causing the parser / expander to error and hang when I passed in any address which contained \0, \u0000 or any 4 byte UTF-8 character.

I did a bit of reading up on the way to work around the issue (this blog was quite instructive: http://banachowski.com/deprogramming/2012/02/working-around-jni-utf-8-strings/), and had a crack at making a fix to avoid the usage of NewStringUTF and GetStringUTFChars and instead passing jbyteArray into and out of the JNI code.

I've added some new tests which were failing before the change, and are now passing for me. This was the first time going anywhere near C code for me so would be really appreciative if someone could check I'm not doing anything really stupid or dangerous e.g. not releasing memory (I can see this project has been dormant for a while so will understand if that's not possible!).

This should fix https://github.com/openvenues/jpostal/issues/36

And also hopefully is the more general solution you referenced in this other pull request: https://github.com/openvenues/jpostal/pull/22

Noteworthy things:

  1. If you pass in a \0 or \u0000 character in the middle of your address string the rest of the string will be truncated (I assume this is because C uses NUL terminated char arrays so it just stops at a NUL char). Removing these from the middle of an address felt like something that should be done as an upfront step by the user rather than in jpostal

  2. There are still usages of GetStringUTFChars and NewStringUTF remaining, but they are for strings which should shouldn't contain problematic characters, e.g. languages and other parser options. I've only changed how the address input string is handled

wboult commented 4 years ago

@albarrentine build passes on JDK 8 but not on JDK 7 due to an issue with protocol being used to download gradle wrapper, I tried upgrading to the latest version of gradle but it didn't help