savi-lang / savi

A fast language for programmers who are passionate about their craft.
BSD 3-Clause "New" or "Revised" License
155 stars 12 forks source link

`String` literals should not be allowed to contain NUL #335

Closed mneumann closed 1 year ago

mneumann commented 1 year ago

Something like "foo\0bar" should be a compile-time error, shouldn't it?

Having NUL (0x0) allowed in a UTF-8 String can cause problems when interacting with FFI. In my case (Postgres), I'd had to validate any string prior of writing it to the stream just to check if there is a NUL character inside.

jemc commented 1 year ago

Nominally, Strings are meant to be UTF8-encoded data (though this is not enforced at compile time or runtime, and malformed data is supposed to be dealt with lazily on extraction, though admittedly these mechanisms aren't fully fleshed out yet).

A null/zero byte is valid UTF8, so I don't think it should be disallowed from inclusion in Strings, even at the conceptual level.

Strings in Savi do not even necessarily have a null terminator at the end of the buffer they point to (for example, a slice taken using trim in the middle of another string won't have a null terminator, because if it did it, would have to create a copy or break other strings that are referring to other parts of the same buffer).

The general advice is that you should avoid FFI functions that make assumptions about a null byte being present at the end or being absent elsewhere, preferring functions that use explicit length-aware string buffers.

I know that's not always possible, but in those cases where such functions are not available, you'll need to resort to suboptimal solutions like scanning for null bytes as you've mentioned - these use cases are not the primary ones that strings in Savi are optimized for.

jemc commented 1 year ago

For more practical advice on the Postgres-specific example:

I haven't worked with the Postgres protocol myself yet, but it sounds like the Postgres backend is fastidious about returning an error when a UTF8 text item contains a null byte, so for a general purpose Postgres library written in Savi, what I'd suggest is just to surface Postgres' own validation error up to the library user when this case is encountered. Then leave it to the user to decide how to deal with this Postgres-specific limitation - they can sanitize their own strings ahead of time, or just log the error if/when it arises on their inputs in practice. The general purpose client library shouldn't need to scan ahead to sanitize the data if it's already going to be validated on the server side - it just needs to gracefully surface the server side error.

Hopefully this approach is possible. I don't know the details of your situation, so I can't be sure.

mneumann commented 1 year ago

@jemc you are right, NUL should be allowed.

Postgres protocol encodes certain Strings NUL terminated. This means, any String containing a NUL cannot be properly encoded.

The question is, should we pro-actively check Strings and report an error on the client side, without sending it to Postgres, or shoud we trust the Postgres server to notice during packet decoding that it got some invalid data and drop the connection. I'd leave that to Postgres right now...