Open romseygeek opened 7 years ago
Thanks for the PR but this is actually a slightly more general problem. Wasn't aware of this, but apparently JNI uses a "modified UTF-8" (b/c Java needs to reinvent all of the wheels...), instead of standard UTF-8, which is required by libpostal. In addition to the NUL-byte terminator, it converts 4-byte UTF-8 sequences into two 3-byte surrogate pairs, which I'm not sure if utf8proc, our decoder, would handle correctly.
So I think I'd prefer to convert the strings to real UTF-8 byte arrays and just NUL-terminate them at the C/JNI level.
If an input string contains a NUL byte, then the JNI string->char* conversion will get confused, and libpostal hangs in the native method call. This PR adds defensive checks to AddressParser and AddressExpander to prevent this.