Broken null-terminated multibyte string handling when JVM is non-US `defaultCharset`

grossws commented 7 years ago

When JVM is started with single-byte default charset string read from native lib are handled incorrectly which results to reading text section (in case of static strings in binary) untill next double NUL-byte is met.

Bug found during trying to show issues with implicit use of defaultEncoding in String#getBytes() in kalium to make PR with forcing use of str.getBytes(StandardCharset.UTF_8) or something similar. And it broke almost all tests instead of one because char *sodium_version_string() return value was misinterpreted (1.0.11chunk_size > (size_t) 0Ustream.nonce != (uint64_t) 0U/dev/randomret == 0/dev/urandomrandombytes/salsa20/randombytes_salsa20_random.c instead of just 1.0.11). And this string built from byte[]/char[] somewhere in jnr-ffi does contain NUL char at position 6.

-> % cat /usr/lib/libsodium.so | xxd | rg 0005d220 -A10
23843:0005d220: 6d61 6c6c 6f63 0031 2e30 2e31 3100 6368  malloc.1.0.11.ch
23844-0005d230: 756e 6b5f 7369 7a65 203e 2028 7369 7a65  unk_size > (size
23845-0005d240: 5f74 2920 3055 0073 7472 6561 6d2e 6e6f  _t) 0U.stream.no
23846-0005d250: 6e63 6520 213d 2028 7569 6e74 3634 5f74  nce != (uint64_t
23847-0005d260: 2920 3055 002f 6465 762f 7261 6e64 6f6d  ) 0U./dev/random
23848-0005d270: 0072 6574 203d 3d20 3000 2f64 6576 2f75  .ret == 0./dev/u
23849-0005d280: 7261 6e64 6f6d 0000 7261 6e64 6f6d 6279  random..randomby
23850-0005d290: 7465 732f 7361 6c73 6132 302f 7261 6e64  tes/salsa20/rand
23851-0005d2a0: 6f6d 6279 7465 735f 7361 6c73 6132 305f  ombytes_salsa20_
23852-0005d2b0: 7261 6e64 6f6d 2e63 0000 0000 0000 0000  random.c........

As could be see from hexdump resulting java string ends where binary contains NUL-byte twice first time in stream.

Env:

linux 4.9.6-1-ARCH, openjdk 8u121
Windows, MSVS2015 on AppVeyor, see build https://ci.appveyor.com/project/grossws/kalium/build/3

To reproduce get code from #107 and run mvn clean test -DargLine=-Dfile.encoding=windows-1251

grossws commented 7 years ago

As I found bug with incorrect handling of null-terminated multibyte strings (NTMBS) was introduced in quite old commit.

That commit introduced two things:

use 4 nulls when passing string-like arguments to ffi calls (which is absolutely safe),
search for multibyte nulls in stream when decoding ffi call result in StringResultConverter#fromNative depending on defaultCharset (aka java file.encoding parameter or external encoding).

Second part is fragile for both NTMBS and null-terminated widechar strings (NTWS) which can lead to several issues.

Read NTMBS past ending null byte (this issue topic).
Read wide string (wchar_t *) partially on UTF-8 system. E. g. on Linux wide string L"test" returned from C would result 74 00 00 00 for first wchar_t in .rodata section on little-endian CPU and only t will be read by jnr-ffi because terminatorWidth is 1 for UTF-8.

Since UTF-8 is common across Linux and MacOSX (and for lot of other systems, I guess) for OpenJDK and OracleJDK that commit accidentaly works fine. And it works for explicitly mentioned US-ASCII and ISO 8859-1 single-byte encodins because terminatorWidth is also set to 1 for them.

Also, passing string-like to wchar_t * arguments seems to be broken too. If I pass java string "ab" to funtion void wstring_dump(const wchar_t *) first wchar_t is 61 62 00 00 intead on 61 00 00 00.

So, it's totally broken for wchar_t * strings which support was declared goal of mentioned commit and is broken for almost all encodings when used with char *.

I think, proper fix requires to split handing of char * (NTMBS) and wchar_t * (NTWS) strings, add annotation to allow user say that NTWS is passed/returned and revert NTMBS handling code to use single null char as NTMBS. Passing strings with 4 nulls is safe and should be continued since it prevents C side from reading past end of string if char * was passed instead of wchar_t *.

headius commented 7 years ago

Hello there!

This is a good find. I'm not surprised to find some problems dealing with non-UTF-8 encodings when passing strings in and out of jnr-ffi. I also agree with your assessment; there need to be a way to specify how strings are expected to be encoded when coming from a library.

Do you perhaps have some idea of the change to make? You've managed to find the problem and diagnose it pretty well...perhaps you can put together a PR that starts making things better?

grossws commented 7 years ago

@headius, I'm not yet familar with jnr-ffi codebase and lack of spare time now to fix this issue. IIRC, jna had separation for string/wstring types to avoid such issues.

I'll have this issue in mind if I have several days to dig into but can't promise anything. Also I have no systems with Windows running and this issue doesn't affect me, so it has quite low priority in my list. If someone steps up to fix it, I'll be happy with it.

m4dc4p commented 6 years ago

This bug bit me today. I don't know how to do a general fix, but for this specific instance you can use the @Encoding annotation to specify that the method returns a US-ASCII encoded string. See https://github.com/abstractj/kalium/issues/87 for details.

jnr / jnr-ffi

Broken null-terminated multibyte string handling when JVM is non-US `defaultCharset` #108