readysettech / readyset

Readyset is a MySQL and Postgres wire-compatible caching layer that sits in front of existing databases to speed up queries and horizontally scale read throughput. Under the hood, ReadySet caches the results of cached select statements and incrementally updates these results over time as the underlying data changes.
https://readyset.io
Other
4.48k stars 124 forks source link

Inconsistent bytes/bit handling for MySQL #373

Open frannoriega opened 1 year ago

frannoriega commented 1 year ago

Description

When using ReadySet with MySQL, we are not being consistent on how we decide to interpret bytes/bit values.

Consider this example:

mysql> CREATE TABLE t (x bit(3));
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO t VALUES (7);
Query OK, 1 row affected (0.03 sec)

mysql> select * from t where x = b'111';
Empty set (0.00 sec)

mysql> select * from t where x = 7;
Empty set (0.00 sec)

The expected result should be (for both select statements):

+------------+
| x          |
+------------+
| 0x07       |
+------------+
1 rows in set (0.00 sec)

For further context, I've seen that:

  1. During snapshotting, we transform mysql_common::value::Value enums into DfValue. These enums are very broad, and we are guessing the types at this stage. The Value::Bytes type gets converted into DfValue::Text (if the bytes can be parsed as a UTF-8 string) or DfValue::ByteArray (readyset-data/src/lib.rs#1664). This is how the values are then stored.
  2. When issuing the queries, we get either DfValue::Int (7) or DfValue::BitVector (b'111'). We attempt to coerce both to DfType::Bit, but even when successful, it still doesn't match the format we use to store.

Commit

79d66f7356ccd29660f696d27f8f74f15e4f6735

Change in user-visible behavior

Any query that predicates upon a Bit column in MySQL is highly likely to return incorrect results.

Requires documentation change

We don't fully support Bit in MySQL

frannoriega commented 1 year ago

I saw further inconsistencies on the way we are using BitVec internally to handle bit values.

When we create Literal::BitVector(Vec<u8>), we get a BitVec that comes from parsing the 0s and 1s from the bit representation (ie, b'111') and then we turn it to bytes by calling to_bytes

Literal::BitVector(bits.to_bytes())

That's a problem, because in order to complete the 8 bits that make a byte, BitVec adds trailing zeroes. So 111 becomes 11100000, instead of 00000111 as expected.

Another issue that I've seen, is that we don't have equality for these values when stored in PersistentState. By this I mean that if we insert 111, then we have to query with 111 in order to get results; using 0111 (or any other value that still semantically represents 111) would throw no results.

gvsg-rs commented 4 months ago

cc: @readyset-railway

Who was also trying to come up with an explanation for this.