scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.65k stars 1.3k forks source link

Row slicing may give wrong answers for compact tables with non-full clustering keys #1446

Open tgrabiec opened 8 years ago

tgrabiec commented 8 years ago

Reproducer:

cqlsh:ks1> create keyspace ks1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh:ks1> create table test (pk int, ck1 int, ck2 int, v int, primary key (pk, ck1, ck2)) with compact storage;
cqlsh:ks1> insert into ks1.test (pk, ck1, v) values (1, 1, 0);
cqlsh:ks1> insert into ks1.test (pk, ck1, v) values (1, 2, 0);
cqlsh:ks1> insert into ks1.test (pk, ck1, v) values (1, 3, 0);
cqlsh:ks1> insert into ks1.test (pk, ck1, ck2, v) values (1, 1, 1, 0);
cqlsh:ks1> insert into ks1.test (pk, ck1, ck2, v) values (1, 2, 1, 0);
cqlsh:ks1> insert into ks1.test (pk, ck1, ck2, v) values (1, 3, 1, 0);
cqlsh:ks1> select * from ks1.test where pk = 1 and ck1 = 1 and ck2 >= 1;

 pk | ck1 | ck2  | v
----+-----+------+---
  1 |   1 | null | 0
  1 |   1 |    1 | 0

(2 rows)

Expected result:


 pk | ck1 | ck2  | v
----+-----+------+---
  1 |   1 |    1 | 0

(1 rows)

The problem is that prefix equality comparator, used for slicing, doesn't partition keys properly if there are prefixes among the partitioned keys. In the example above where we have keys (1) and (1,1) and a slice with inclusive starting bound of (1, 1), doing lower_bound will return both keys because both are equal to (1,1) in prefix equality.

We could fix the problem by using heterogenous comparator, which would treat prefixes in the range differently than prefix in the partitioned sequence. The latter would be treated as if they were full keys with missing components comparing less than anything.

gleb-cloudius commented 8 years ago

Did you mean "we have keys (1) and (1,1)" (not 1,2)?

tgrabiec commented 8 years ago

@gleb-cloudius Yes. Corrected, thanks.

duarten commented 8 years ago

Out of curiosity, why are non-compact tables not affected?

pdziepak commented 8 years ago

Partial clustering keys are allowed only in compact tables. For non-compact ones all clustering keys need to have all components non-null and in such case our current comparators work correctly.

duarten commented 8 years ago

Thanks! Is is this just a CQL thing or is there a limitation in the storage format I'm missing?

tgrabiec commented 8 years ago

There is a similar issue affecting application of range tombstones created from thrift where the end bound is a prefix with eoc=0. bound_view doesn't distinguish between a prefix which covers all keys prefixed by it from the prefix which covers only the key equal to itself (thrift composite's eoc determines that).

Example:

$ sudo sstable2json /var/lib/scylla/data/ks1/c2-e4e72ef06df511e6b310000000000000/ks1-c2-ka-4-Data.db 
[
{"key": "1",
 "cells": [["","00000000",1470761440040513],
           ["a","asd",2470761440040513,"t",1470764842],
           ["asd:","00000000",1470761451368658],
           ["asd:asd","00000000",1470761449416613]]}
]

Expected result:

cqlsh> select * from ks1.c2;

 id | c1  | c2   | v
----+-----+------+---
  1 |     | null | 0
  1 | asd |      | 0
  1 | asd |  asd | 0

(3 rows)

Actual result:

cqlsh> select * from ks1.c2;

 id | c1 | c2   | v
----+----+------+---
  1 |    | null | 0

(1 rows)

C* 3.x has similar issue for which I opened https://issues.apache.org/jira/browse/CASSANDRA-12423

tgrabiec commented 8 years ago

Maybe the easiest thing to do is to convert prefix keys to full keys with empty trailing components, and make clustering_key_prefix != clustering_key as it used to be. We would avoid the confusion which stems from clustering_key_prefix mapping to different concepts (a key and a key interval)

duarten commented 8 years ago

That sounds good. An alternative, I guess, would be to introduce a different bound_kind for this purpose.

tgrabiec commented 8 years ago

Different bound_kind would also work, and is in fact less intrusive.

slivne commented 7 years ago

@tgrabiec / @haaawk - can we close this - there are many patches referencing this issue

haaawk commented 7 years ago

I need to double check but I think it might not be fully done yet.

tgrabiec commented 7 years ago

There as some remaining problems. There's this for example:

        // FIXME: Currently position_range cannot represent a range end bound which
        // includes just the prefix key or a range start which excludes just a prefix key.
        // In both cases we should return lexicographical_relation::before_all_strictly_prefixed here.
        // Refs #1446.
slivne commented 4 years ago

mc sstables provide:

reducing the priority of this issue

tgrabiec commented 4 years ago

@slivne compact storage is not just about saving space in sstables, it is also a thrift-compatibility mode which affects the shape of data as seen by the thrift API. You'll get different results for the same state of the database without compact storage.

bhalevy commented 2 years ago

@tgrabiec if the only thing left here is relevant only to thrift, let's add the thrift_api label

tgrabiec commented 2 years ago

@bhalevy It's not only relevant for thrift api, you can reach this via CQL if you use with compact storage on your table.