ralfbiedert / interoptopus

The polyglot bindings generator for your library (C#, C, Python, …) 🐙
302 stars 27 forks source link

AsciiPointer is UTF8Pointer? #50

Open seaeagle1 opened 2 years ago

seaeagle1 commented 2 years ago

To my pleasant surprise, I discovered that the AsciiPointer pattern plainly uses Rust CString, which is UTF-8 instead of ASCII. This is not an issue when transferring strings into Rust, as ASCII is a subset of UTF-8, so Rust will happily handle both. However, it might surprise users when accepting strings from Rust, when the Rust code accidentally outputs unicode characters (there's currently no check on this).

There seem to be two solutions to me: either perform a check in the Rust-implementation of AsciiPointer to make sure the string actually is ASCII, or, for me preferable, just support the unicode. For C both are null-terminated byte-strings anyway, so it just needs a clear warning sign for developers. For C# I've made a few small changes to the backend to explicitly marshal strings as UTF-8 which seems to work quite nicely (see https://github.com/seaeagle1/interoptopus/tree/utf8 ). I don't know much Python, but I'm sure it's also able to handle unicode strings.

ralfbiedert commented 2 years ago

tl;dr - I agree we should have an additional UTF8Pointer, but I think AsciiPointer should stay ASCII.

it might surprise users when accepting strings from Rust, when the Rust code accidentally outputs unicode characters (there's currently no check on this).

Maybe I misunderstand, but don't we have already "reasonable checks" for that? Like, you can only create an AsciiPointer from a &CStr or &[u8]; and in order to squeeze in a &str in you'd need to s.as_bytes(), but then again it's so easy to get a byte slice from anywhere, so technically there's "infinitely many" failure modes once you start accepting byte slices for stringy things. None of them UB, they just give weird results in C# & co.

There seem to be two solutions to me: either perform a check in the Rust-implementation of AsciiPointer to make sure the string actually is ASCII, or, for me preferable, just support the unicode.

How about we introduce a new type UTF8Pointer instead? That way you can express more clearly what you really want on both sides. The current AsciiPointer would stay as-is, (again, it has to accept a &[u8] in any case, which is hard to protect), but a UTF8Pointer might only have some ::from_str(&str)

I don't know much Python, but I'm sure it's also able to handle unicode strings.

Yes, although it would be good to check how Python's internal representation and any conversion plays along with Python's object "lifetimes" when calling FFI.

This is not a blocker for C#, but a possible footgun when implementing the Python codegen and convenience wrappers. For example is there an utf8"test" string literal representation in Python, and if not would this Python pseudocode ffi_passthrough("test".to_utf8()).to_string().last_char() segfault?

seaeagle1 commented 2 years ago

tl;dr - I agree we should have an additional UTF8Pointer, but I think AsciiPointer should stay ASCII.

That's another option, but seems a lot of duplication for relatively little gain, since you can use an UTF8Pointer perfectly fine for ASCII as well, as long as you write the additional check if it's within the ASCII range yourself.

Maybe I misunderstand, but don't we have already "reasonable checks" for that? Like, you can only create an AsciiPointer from a &CStr or &[u8]; and in order to squeeze in a &str in you'd need to s.as_bytes(), but then again it's so easy to get a byte slice from anywhere, so technically there's "infinitely many" failure modes once you start accepting byte slices for stringy things. None of them UB, they just give weird results in C# & co.

I couldn't find one - There's one commented out in the from_bytes() section, but actively right now the only check is whether it's a null-terminated string (which is also the only check that CString::new() does and the one that's indeed required for safety). That's sort of my point: for the whole Rust and C interface bit, it already fully supports the extended UTF8 range as is. It's just that when it reaches C# (and Python?) they're being limited to ASCII. The string-types in both also support unicode by default - so why not use that. If your further application then only supports ASCII, it's easy to write a string.is_ascii()-like check somewhere.

I don't know much Python, but I'm sure it's also able to handle unicode strings.

Yes, although it would be good to check how Python's internal representation and any conversion plays along with Python's object "lifetimes" when calling FFI.

This is not a blocker for C#, but a possible footgun when implementing the Python codegen and convenience wrappers. For example is there an utf8"test" string literal representation in Python, and if not would this Python pseudocode ffi_passthrough("test".to_utf8()).to_string().last_char() segfault?

Well, here my Python knowledge is lacking. It would be interesting to test what happens if you feed UTF8 characters through the Python bindings... from what I can find the default byte-coding in recent Python versions is UTF8, so glancing at the current code it might just work as-is.

ralfbiedert commented 2 years ago

That's another option, but seems a lot of duplication for relatively little gain, since you can use an UTF8Pointer perfectly fine for ASCII as well, as long as you write the additional check if it's within the ASCII range yourself.

I agree on the Rust side, but I'm concerned about the FFI side.

If we only had UTF8Pointer that means every backend and every language will have to support that. Since UTF-8 can contain 0x0, that would mean these backends couldn't "type-safely" receive strings anymore from Rust.

In other words, if you wrote a Rusty fn f() -> UTF8Pointer, in a hypothetical X-Lang that doesn't have native UTF-8 support you couldn't easily generate f_wrapped() -> String anymore, as now the implicit contract is that Rust would either have to filter its strings, or 'X-Lang' might receive strings cut short (at an inner 0x0).

I couldn't find one - There's one commented out in the from_bytes() section, but actively right now the only check is whether it's a null-terminated string (which is also the only check that CString::new() does and the one that's indeed required for safety).

Yeah, I was originally hoping to filter for 7-bit ASCII, but it's usually the roundtrip scenario that makes it hard (like in that commented out check when C# calls)

If your further application then only supports ASCII, it's easy to write a string.is_ascii()-like check somewhere.

I'm not entirely sure where that call could be done. I would agree that string.is_ascii() is about as wonky as being able to (having to) accept random &[u8] slices, but if these two strings differ even just slightly, why not use two types (compare Rust's 10+ stringy types ...).

FWIW, I'm generally sympathetic to the argument "too many similar types" / code duplication complexity, but in this case I think it's inherent complexity that's better handled explicitly.

FeldrinH commented 1 year ago

Maybe I'm misunderstanding something, but it seems to me that AsciiPointer is in practice just a 0x0 terminated u8 slice, not strictly ASCII, UTF-8 or any other encoding. If that is the case then calling it AsciiPointer seems misleading and will probably lead to users making dangerous assumptions.

ralfbiedert commented 1 year ago

just a 0x0 terminated u8 slice, not strictly ASCII, UTF-8 or any other encoding

Yes, technically it's a &CStr equivalent. It's called ASCIIPointer as it was mostly used on incoming calls to signal that callers should only provide ASCII data.

You are right from an outgoing perspective a Rust user might provide non-ASCII data which 3rd party languages might see, and after reviewing this thread I think there was some confusion on my side due to the title ('AsciiPointer is UTF8Pointer'). There's probably two ways forward:

If renamed, we might want to add more stringy types.