sfackler / rust-postgres

Native PostgreSQL driver for the Rust programming language
Apache License 2.0
3.39k stars 429 forks source link

Why pay the price of `std::str::from_utf8`? #1148

Closed EvanCarroll closed 1 month ago

EvanCarroll commented 1 month ago

According to the code

This library assumes that the client_encoding backend parameter has been set to UTF8. It will most likely not behave properly if that is not the case.

If we only support the client_backend using utf8, why do we deserialize strings using std::str::from_utf8 rather than from_utf8_unchecked. It just seems pointless to forgo the benefit of knowing the server is sending utf8 encoded strings if we're still checking them.

sfackler commented 1 month ago
  1. It is undefined behavior to work with Rust strings that aren't UTF-8. Assuming that a remote service will send UTF-8 to you is not sufficient to use that unsafe constructor.
  2. What is the overhead of this validation, and how have you measured it?
EvanCarroll commented 1 month ago

I agree, it's undefined behavior to work with Rust strings that aren't UTF-8. I just don't understand why it's an assumption here that the remote service won't send UTF-8 when the assumption elsewhere is that must send you UTF-8 or the whole library is undefined behavior "most likely not behave properly". If the client_encoding is utf-8, that's all PostgreSQL can send back.. You're telling it to do the encoding for you, explicitly.

client_encoding (string) # Sets the client-side encoding (character set). The default is to use the database encoding. The character sets supported by the PostgreSQL server are described in Section 24.3.1.

Currently you'll just generate errors that you can handle (there is no logic to handle other client_encodings). But we know these errors can never be hit because we explicitly set the client_encoding when we connect.

As for benchmarking, there is a benchmark of the overhead of utf8 checking here. Depending on the input you'll be limited on an x86-64 (AMD Zen2) to between 321 MB/sec - 32.5 GB/sec. This benchmark is from the simd-accelerated validation library simdutf8, which is comparing their performance which is 22x better on some cyrillic workloads (on that benchmark simdutf8 is labeled as "basic-v0.1.2").

All that is suffice to say, I don't get the safety benefit here. If you're tell PostgreSQL to only send UTF-8 it can't send back anything else (afaik), I don't get the benefit.

sfackler commented 1 month ago

UTF-8 or the whole library is undefined behavior "most likely not behave properly".

That is not what undefined behavior is.

sfackler commented 1 month ago

As for benchmarking, there is a benchmark of the overhead of utf8 checking here. Depending on the input you'll be limited on an x86-64 (AMD Zen2) to between 321 MB/sec - 32.5 GB/sec

The speed of UTF-8 validation in a vacuum doesn't really matter; it only matters in the context of the overall system. If all of that validation was magically removed, how much faster would the average process using a Postgres client be? Would that difference even be measurable?

EvanCarroll commented 1 month ago

I'm not exactly sure what you want at this point, you want me to do a benchmark that shows that the micro benchmarks provided above can be realize thorough the database too? Sounds easy enough.

PostgreSQL maximum field size is 1GB. I'll generate 20 rows on a 1 column table where each column has 1 GB of text. Then I'll show that retrieving from the table with from_utf8 is substantially slower than from_utf8_unchecked? Is this what you're looking for?

Seems obvious to me. The optimization is there precisely for libraries like this one.

sfackler commented 1 month ago

To be clear, it would not be responsible for this library to ever use from_utf8_unchecked on data received from a remote service, whether or not the the client has asked nicely to be given data in that format.

If you are in reality trying to retrieve 1GB string fields from the database (?????) and your benchmarking shows that UTF-8 validation is an unacceptable cost (or even noticeable when factoring in the time taken by Postgres to retrieve and serialize those fields, the time it takes them to traverse the network, etc), you can always make your own FromSql implementation that uses the unchecked constructor.