Closed EvanCarroll closed 1 month ago
I agree, it's undefined behavior to work with Rust strings that aren't UTF-8. I just don't understand why it's an assumption here that the remote service won't send UTF-8 when the assumption elsewhere is that must send you UTF-8 or the whole library is undefined behavior "most likely not behave properly". If the client_encoding
is utf-8, that's all PostgreSQL can send back.. You're telling it to do the encoding for you, explicitly.
client_encoding (string) # Sets the client-side encoding (character set). The default is to use the database encoding. The character sets supported by the PostgreSQL server are described in Section 24.3.1.
Currently you'll just generate errors that you can handle (there is no logic to handle other client_encodings
). But we know these errors can never be hit because we explicitly set the client_encoding
when we connect.
As for benchmarking, there is a benchmark of the overhead of utf8 checking here. Depending on the input you'll be limited on an x86-64 (AMD Zen2) to between 321 MB/sec - 32.5 GB/sec. This benchmark is from the simd-accelerated validation library simdutf8, which is comparing their performance which is 22x better on some cyrillic workloads (on that benchmark simdutf8 is labeled as "basic-v0.1.2").
All that is suffice to say, I don't get the safety benefit here. If you're tell PostgreSQL to only send UTF-8 it can't send back anything else (afaik), I don't get the benefit.
UTF-8 or the whole library is undefined behavior "most likely not behave properly".
That is not what undefined behavior is.
As for benchmarking, there is a benchmark of the overhead of utf8 checking here. Depending on the input you'll be limited on an x86-64 (AMD Zen2) to between 321 MB/sec - 32.5 GB/sec
The speed of UTF-8 validation in a vacuum doesn't really matter; it only matters in the context of the overall system. If all of that validation was magically removed, how much faster would the average process using a Postgres client be? Would that difference even be measurable?
I'm not exactly sure what you want at this point, you want me to do a benchmark that shows that the micro benchmarks provided above can be realize thorough the database too? Sounds easy enough.
PostgreSQL maximum field size is 1GB. I'll generate 20 rows on a 1 column table where each column has 1 GB of text. Then I'll show that retrieving from the table with from_utf8
is substantially slower than from_utf8_unchecked
? Is this what you're looking for?
Seems obvious to me. The optimization is there precisely for libraries like this one.
To be clear, it would not be responsible for this library to ever use from_utf8_unchecked on data received from a remote service, whether or not the the client has asked nicely to be given data in that format.
If you are in reality trying to retrieve 1GB string fields from the database (?????) and your benchmarking shows that UTF-8 validation is an unacceptable cost (or even noticeable when factoring in the time taken by Postgres to retrieve and serialize those fields, the time it takes them to traverse the network, etc), you can always make your own FromSql
implementation that uses the unchecked constructor.
According to the code
If we only support the
client_backend
using utf8, why do we deserialize strings usingstd::str::from_utf8
rather than from_utf8_unchecked. It just seems pointless to forgo the benefit of knowing the server is sending utf8 encoded strings if we're still checking them.