openvenues / jpostal

Java/JNI bindings to libpostal for for fast international street address parsing/normalization
MIT License
105 stars 42 forks source link

Add defensive checks for NUL bytes in input strings #22

Open romseygeek opened 7 years ago

romseygeek commented 7 years ago

If an input string contains a NUL byte, then the JNI string->char* conversion will get confused, and libpostal hangs in the native method call. This PR adds defensive checks to AddressParser and AddressExpander to prevent this.

albarrentine commented 7 years ago

Thanks for the PR but this is actually a slightly more general problem. Wasn't aware of this, but apparently JNI uses a "modified UTF-8" (b/c Java needs to reinvent all of the wheels...), instead of standard UTF-8, which is required by libpostal. In addition to the NUL-byte terminator, it converts 4-byte UTF-8 sequences into two 3-byte surrogate pairs, which I'm not sure if utf8proc, our decoder, would handle correctly.

So I think I'd prefer to convert the strings to real UTF-8 byte arrays and just NUL-terminate them at the C/JNI level.