scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.52k stars 1.29k forks source link

Allow row cache to store incomplete partitions #960

Closed tgrabiec closed 7 years ago

tgrabiec commented 8 years ago

Right now cache can hold partition in its entirety ("complete") or not at all. This is problematic for workloads creating large partitions:

To deal with this, we should allow cache to hold incomplete partitions. The unit of eviction could be a CQL row.

haaawk commented 8 years ago

Should we make cache_entry contain a single CQL row to be able to use existing eviction mechanism which, if I'm not mistaken, works on cache_entry level?

If there are no clustering columns then cache_entry will contain whole partition.

avikivity commented 8 years ago

That's my understanding. But we need to ensure that an entry knows whether it is contiguous with the next entry, so that range queries continue to work.

haaawk commented 8 years ago

I'm not sure what the current implementation of the set of cache entries is but with a balanced tree implementation moving to the next entry would be fast.

tgrabiec commented 8 years ago

@haaawk The contiguity bit is about determining whether there are any uncached entries (in sstables) between current row and the next one when reading from cache. If so, the reader will have to read them from sstables. We have similar problem on partition level, for partition range queries.

haaawk commented 8 years ago

I see. Thanks for the clarification :).

haaawk commented 8 years ago

It seems to me that we can say whether a partition is missing in a cache thanks to partition key indexes of sstables. It makes me think that maybe we should do CQL row indexes first before changing the cache to work on CQL row level. With a CQL row indexes we will be able to decide whether we're missing a CQL row in cache and need to go to disk the same way we do that for partitions.

What do you think?

pdziepak commented 8 years ago

The way cache handles range queries is suboptimal (see #1179). Row cache should on its own know if its missing any data. So for range queries the best solution seems to be having a flag determining whether the partition immediately following this one is also in the cache.

For clustering rows we could just make the partitions in the cache store something like the following:

union clustering_row_or_missing_info {
    struct missing_info {
        clustering_key_prefix start;
        clustering_key_prefix end;
    };
    clustering_row row;
};

missing_info entries would keep track of the ranges that row cache knows nothing about. Any read will need to fall back to sstable reader iff it encounters missing_info entry instead of clustering_row. In this case row cache should also guarantee that all tombstones relevant to the present clustering rows are also present.

belliottsmith commented 8 years ago

You only know from the in-memory caches/structures that you do not need to go to disk. Not that you do. You must assume that you have to go to disk unless there is explicit information to the contrary; a known empty range from a pointer in the cache, a bloom filter, or the min/max range of the sstables.

The sstable partition indexes are a disk access. In fact that index is already also a row index, just not a very good one for the latter.

Bloom filters are typically what you use for avoiding going to disk, and are a pretty bad idea to use for rows since they do not support ranges, so you would have to know all possible values in a range and try each, any of which may return a false positive and tell you to access the disk anyway. So they'd be an expensive waste of time.

Knowing the contiguous ranges you have in your cache really is the only logical approach to this problem, however you slice that.

tgrabiec commented 8 years ago

2016-05-13 12:49 GMT+02:00 Paweł Dziepak notifications@github.com:

The way cache handles range queries is suboptimal (see #1179 https://github.com/scylladb/scylla/issues/1179). Row cache should on its own know if its missing any data. So for range queries the best solution seems to be having a flag determining whether the partition immediately following this one is also in the cache.

For clustering rows we could just make the partitions in the cache store something like the following:

union clustering_row_or_missing_info { struct missing_info { clustering_key_prefix start; clustering_key_prefix end; }; clustering_row row; };

missing_info entries would keep track of the ranges that row cache knows nothing about. Any read will need to fall back to sstable reader iff it encounters missing_info entry instead of clustering_row. In this case row cache should also guarantee that all tombstones relevant to the present clustering rows are also present.

I think having a contiguity flag would be simpler and require less memory than a union type with missing_info. The start and end keys of a missing range are already there in cache, in current and the next cache entry. missing_info entry would duplicate that information. Also, having polimorphic entries in cache would require extra memory and branching.

haaawk commented 8 years ago

Hmm. So you want to keep it as a sorted list and do binary search on it? That will make cache update O(number of entries).

I was thinking about some sort of interval tree which would keep both lookups and updates in O(log(number of entries)) but don't have anything concrete yet.

pdziepak commented 8 years ago

On 13 May 2016 at 12:14, Piotr Jastrzębski notifications@github.com wrote:

Hmm. So you want to keep it as a sorted list and do binary search on it? That will make cache update O(number of entries).

Predecessors and successors of elements in trees create a sorted list, only that you still have O(log N) operations.

haaawk commented 8 years ago

Right. That should work too :).

haaawk commented 8 years ago

One comment regarding contiguity bit though. It's not exactly enough. Let's say we have data with following partition keys: 1, 1000, 1100, 2000. And then we do following range queries: [1, 5], [10, 1500], [2, 999].

  1. After first query we only have a single partition with key 1 in cache and contiguity bit set to false. We totally lose the info that there's nothing in [4,5].
  2. After second query we have 3 partitions in cache with keys 1, 1000 and 1100 and only 1000 will have contiguity bit set. We again lose the information that there's nothing in [10, 999] and [1101, 1500].
  3. For third query we need to check whole range [2, 999] even though we should be able to reduce it to [6, 9].

I think having some placeholders saying range [a,b] has no rows would be useful.

pdziepak commented 8 years ago

On 13 May 2016 at 12:34, Piotr Jastrzębski notifications@github.com wrote:

One comment regarding contiguity bit though. It's not exactly enough. Let's say we have data with following partition keys: 1, 1000, 1100, 2000.

I think you meant clustering keys.

And then we do following range queries: [1, 5], [10, 1500], [2, 999].

  1. After first query we only have a single partition with key 1 in cache and contiguity bit set to false. We totally lose the info that there's nothing in [4,5].
  2. After second query we have 3 partitions in cache with keys 1, 1000 and 1100 and only 1000 will have contiguity bit set. We again lose the information that there's nothing in [10, 999] and [1101, 1500].
  3. For third query we need to check whole range [2, 999] even though we should be able to reduce it to [6, 9].

The size of the range doesn't matter, the fact that there is a range cache know nothing about is important since it causes it to fall back to sstables. So the only problem in this scenario is that 1 doesn't get its "immediate successor in the cache" bit set. That would happen with my original missing_info proposal as well (unless the ranges user ask for are adjacent), but only for the first time.

Still, in your scenario repeated queries that start somewhere in (1, 10) range will require IO all the time (well, index cache may help, but that's not my point) and I don't think cache can easily get enough information from sstables to try to set the continuity bits and merge the ranges it knows about.

pdziepak commented 8 years ago

On 13 May 2016 at 12:09, Tomasz Grabiec notifications@github.com wrote:

2016-05-13 12:49 GMT+02:00 Paweł Dziepak notifications@github.com:

The way cache handles range queries is suboptimal (see #1179 https://github.com/scylladb/scylla/issues/1179). Row cache should on its own know if its missing any data. So for range queries the best solution seems to be having a flag determining whether the partition immediately following this one is also in the cache.

For clustering rows we could just make the partitions in the cache store something like the following:

union clustering_row_or_missing_info { struct missing_info { clustering_key_prefix start; clustering_key_prefix end; }; clustering_row row; };

missing_info entries would keep track of the ranges that row cache knows nothing about. Any read will need to fall back to sstable reader iff it encounters missing_info entry instead of clustering_row. In this case row cache should also guarantee that all tombstones relevant to the present clustering rows are also present.

I think having a contiguity flag would be simpler and require less memory than a union type with missing_info. The start and end keys of a missing range are already there in cache, in current and the next cache entry. missing_info entry would duplicate that information. Also, having polimorphic entries in cache would require extra memory and branching.

Well, continuity bit requires branching for each read clustering row as well. Yes, there will be increased memory usage but as Piotr pointed out storing missing_info gives cache more useful information so the start and end of missing range won't be duplicated.

haaawk commented 8 years ago

I meant a partition key. For simplicity let's talk about full partition cache for a moment since the mechanism is similar.

It's not only about the size of range if we modify the example to so that second query is [6, 1500] then the third query should be fully cached.

Rather than missing info I would rather see a single clustering_key_prefix in an entry saying how far after this entry we have checked. That would be enough to say if we know everything about the range between current entry and next entry. That would save us one clustering_key_prefix because we really don't need the 'end' from missing info.

Why would repeated queries starting from (1, 10) be always going to disk? Once a query is executed for the first time and returns no results then we store info in cache saying range [a, b] is empty. Next time we have this query we don't go to disk since we know there's nothing there.

belliottsmith commented 8 years ago

Alternatively, dummy "missing" rows with a contiguity bit would provide the lowest memory overhead in the case of mostly-present row accesses (which you'd probably expect to be most common). These can also easily be removed (like missing_info, but with less complexity) when a new overlapping query is performed.

I think it would be sensible to consider how all of these approaches play with eviction. It probably is not a good idea to keep eviction rooted to the partition key once this change goes in, else a few partitions that are appended to, but for which only the most recent rows are accessed, could completely dominate the cache.

haaawk commented 8 years ago

How about the following solution:

We need the information about ranges we have checked in cache. We need to store data in those ranges if any. We can have 3 tree-based sets: range begin (partition_key), range end (partition_key), data (cache_entry).

In this case we can either store ranges we know about or ranges we don't know about. As @belliottsmith observed this may be more efficient.

@belliottsmith With this change we will do eviction on a CQL row level.

belliottsmith commented 8 years ago

And so that raises the question of how eviction treats these missing records.

Also, whether or not you really want to track eviction for individual rows or for ranges of rows, since the memory cost per row would be significant for small rows, seems pertinent to this whole discussion.

pdziepak commented 8 years ago

On 13 May 2016 at 13:06, Piotr Jastrzębski notifications@github.com wrote:

I meant a partition key. For simplicity let's talk about full partition cache for a moment since the mechanism is similar.

In which case I think you mean token ;) But, you're right, it similar enough.

It's not only about the size of range if we modify the example to so that second query is [6, 1500] then the third query should be fully cached.

That's what I said size of the range itself doesn't matter, what matters is whether cache needs to consult sstables or not.

Rather than missing info I would rather see a single clustering_key_prefix in an entry saying how far after this entry we have checked. That would be enough to say if we know everything about the range between current entry and next entry. That would save us one clustering_key_prefix because we really don't need the 'end' from missing info.

If you keep only half the data then I'm afraid you improve only half of the possible cases. Suppose we have a partition with only a single clustering row 5. I ask for [2, 100] range and that rows gets into the cache with end = 100. So the cache knows that it knows all there is to know about [5, 100] range. Then I do that query again and cache still needs to go to the sstables to ask about [2, 5) range.

Why would repeated queries starting from (1, 10) be always going to disk? Once a query is executed for the first time and returns no results then we store info in cache saying range [a, b] is empty. Next time we have this query we don't go to disk since we know there's nothing there.

That was an example for continuity bit solution.

haaawk commented 8 years ago

In which case I think you mean token ;) But, you're right, it similar enough. token. partition key. details :)

If you keep only half the data then I'm afraid you improve only half of the possible cases. Suppose we have a partition with only a single clustering row 5. I ask for [2, 100] range and that rows gets into the cache with end = 100. So the cache knows that it knows all there is to know about [5, 100] range. Then I do that query again and cache still needs to go to the sstables to ask about [2, 5) range.

I see. I previously misunderstood what you mean by 'missing'. I thought you mean missing in cache but it seems you mean missing as a range which has no rows and cache knows about it.

That was an example for continuity bit solution. I agree then :)

pdziepak commented 8 years ago

On 13 May 2016 at 13:46, Piotr Jastrzębski notifications@github.com wrote:

In which case I think you mean token ;) But, you're right, it similar enough. token. partition key. details :)

If you keep only half the data then I'm afraid you improve only half of the possible cases. Suppose we have a partition with only a single clustering row 5. I ask for [2, 100] range and that rows gets into the cache with end = 100. So the cache knows that it knows all there is to know about [5, 100] range. Then I do that query again and cache still needs to go to the sstables to ask about [2, 5) range.

I see. I previously misunderstood what you mean by 'missing'. I thought you mean missing in cache but it seems you mean missing as a range which has no rows and cache knows about it.

No, missing_info is a range that cache knows nothing about. But it also gives us information about ranges that have no rows a cache knows about. With missing_info solution that example will look like this:

  1. Query [2,100] causes clustering row 5 and missing_info (-inf, 2) and (100, inf) to be inserted to cache.
  2. All subsequent [2, 100] queries are served entirely from cache since readers do not come across any missing_info entries.
haaawk commented 8 years ago

Ok that's what my first understanding was.

tgrabiec commented 8 years ago

2016-05-13 14:53 GMT+02:00 Paweł Dziepak notifications@github.com:

On 13 May 2016 at 13:46, Piotr Jastrzębski notifications@github.com wrote:

In which case I think you mean token ;) But, you're right, it similar enough. token. partition key. details :)

If you keep only half the data then I'm afraid you improve only half of the possible cases. Suppose we have a partition with only a single clustering row 5. I ask for [2, 100] range and that rows gets into the cache with end = 100. So the cache knows that it knows all there is to know about [5, 100] range. Then I do that query again and cache still needs to go to the sstables to ask about [2, 5) range.

I see. I previously misunderstood what you mean by 'missing'. I thought you mean missing in cache but it seems you mean missing as a range which has no rows and cache knows about it.

No, missing_info is a range that cache knows nothing about. But it also gives us information about ranges that have no rows a cache knows about. With missing_info solution that example will look like this:

  1. Query [2,100] causes clustering row 5 and missing_info (-inf, 2) and (100, inf) to be inserted to cache.
  2. All subsequent [2, 100] queries are served entirely from cache since readers do not come across any missing_info entries.

With this design eviction of a row entry would have to replace it with a missing_info entry.

One serious complication which this introduces is that after some time missing_info entries would accumulate and thus should be subject to eviction as well. We would have to "evict" them by merging adjacent ones. The question is how to do that efficiently and effectively. If missing_info entries are separated by row entries, we would have to evict row entries, possible quite a few of them, in order to evict a missing_info entry.

Another issue to be aware of is that with this approach eviction would need to allocate. This may hurt performance after cache fills up. It may also threaten forward progress of eviction if we have large keys.

Here's an alternative approach, which would also handle the case of repeated range queries over sparse data, which was brought up in this thread. Instead of having missing_info entries, we have continuity flag on each entry and new kind of entry with just a key and no data. Such sentinel entry would be used to represent range contiguity information for bounds which don't have entries in cache. Advantage is fast and alloc-free eviction of all kinds of entries. When evicting an entry, we clear the contiguity flag of the predecesor.

belliottsmith commented 8 years ago

Here's an alternative approach

...

Alternatively, dummy "missing" rows with a contiguity bit would provide the lowest memory overhead

This approach is also probably the simplest to maintain; if a query overlaps such a dummy row it can easily remove it in the process of execution, helping prevent accumulation.

haaawk commented 8 years ago

It sounds like a good idea but I see one problem with this approach. There is no way to mark in cache that a range (-inf, a) has no rows. Maybe this is not a big deal and we can just ignore it but it can cause an IO on every query with range starting with -inf.

haaawk commented 8 years ago

Well we could potentially add a cache entry with a key of value -inf and continuous flag set to true.

avikivity commented 8 years ago

I'll add that entries that have keys but not values have their uses -- they can be used to track cache misses "beyond the LRU" and so improve the replacement algorithm. For partitions, we can store the sstable index information there, and so avoid having to read the index.

Our LRU would look like

[<--- entries with keys and values --->][<---- entries with keys and index info --->][<---- entries with just keys --->]

The first section would be shortest (but occupy most memory, due to the values). When we evict from it, we drop the value and push it to the next section; from there its is evicted to the last section, which can be longest but consume the least amount of memory. Having the extra-long LRU allows us to maintain make better eviction decisions because we have more information about the access patterns.

gleb-cloudius commented 8 years ago

How big index info is that justifies third stage?

        Gleb.
avikivity commented 8 years ago

It is O(sstables known to have that partition). Maybe we don't need the third stage.

ben-manes commented 8 years ago

@avikivity

Maintaining ghost entries is a potentially expensive way to track history. You can get the a similar effect by using a frequency sketch instead. See TinyLFU (paper) for an example of this approach.

belliottsmith commented 8 years ago

@ben-manes I'm a huge proponent of probabilistic approaches, and of Tiny-LFU specifically as I'm sure you know. In fact I mentioned it to Avi several months ago in context of the partition cache. However here it really is not suitable at all. Currently I know of no probabilistic approaches with a reasonable behaviour for inequality searches, which this structure must support. It would be of literally zero utility, much like the bloom filter suggestion earlier.

Whether or not the loss of range queries would be a cost worth bearing for the partition cache is a more complex question. I think that the answer would likely be "yes" - but it would depend on workload and other factors. There are also other concerns; as the application gets faster and memory-bound queries dominate, the poor cache locality of the probabilistic operations likely becomes more of an issue. Though it is something I have not had as much time to consider as I would like.

ben-manes commented 8 years ago

Sorry, I was merely skimming the repo and saw the comment on ghost entries. I really didn't read enough about the issue to understand relevance. Apologies for the distraction

tgrabiec commented 7 years ago

Fixed by 9b21a9bfb61b1ab4fa4b03488a397d85fc343a36