OOM during large repair

scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra

http://scylladb.com

GNU Affero General Public License v3.0

13.64k stars 1.3k forks source link

OOM during large repair #376

Closed nyh closed 9 years ago

nyh commented 9 years ago

In my tests involving repair of a large database, I always eventually reach an OOM (i.e., an std::bad_alloc) error:

ERROR [shard 0] stream_session - stream_transfer_task: Fail to send outgoing_file_message to 127.0.0.2:0: std::bad_alloc (std::bad_alloc)

An example scenario is as follows: I have two 1-cpu nodes running on two IP addresses on a local machine, each given (with --cpuset) one cpu, and 1GB of memory. I add 1 million partitions to both (RF=2) using cassandra-stress:

cassandra-stress write n=1000000 -schema "replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=2)"

Finally I start a repair:

curl -X POST "http://127.0.0.2:10000/storage_service/repair_async/keyspace1"

After a while I get the above exception, which eventually leads to the stream's failure, and finally the entire repair's failure.

It seems that using more memory only postpones the OOM, so the question is why it happens - do we really need so much memory for streaming mutations (why?), or do we allocate large buffers and face a fragmentation problem?

nyh commented 9 years ago

My investigation turned up a lot of mysterious stuff, and a workaround to make repair succeed:

First of all, it appears that the failing allocation isn't a particularly big one, just some code trying to copy a small mutation. It looks like we really are out of memory.

Second, the backtrace of the allocation failure led me discover that when the streamer (part of the repair) reads the sstables, it caches everything it reads in the row cache! This happens because column_family::make_reader() (which is what gets called under the hood) has the code

    if (_config.enable_cache) {
        readers.emplace_back(_cache.make_reader(range));
    } else {
        readers.emplace_back(make_sstable_reader(range));
    }

And _cache.make_reader() appears to cache every single row it reads. @tgrabiec, is this deliberate? It seems wrong to me to overwrite the entire cache with all the data read during a repair (and also during compaction, etc.).

Finally, even if we do cache everything we read, why do we run out of memory? The cache needs to evict stuff once the memory gets full. It seems this eviction stuff is not working correctly. @tgrabiec , any idea how I can check it, or who is our expert on the eviction stuff?

nyh commented 9 years ago

Oh, I forgot to say that with this patch, my first large repair test (1 million partitions, with 1 GB memory for scylla) passes. It's probably not the right patch, but it's a workaround:

diff --git a/database.cc b/database.cc
index fead4ef..aca958b 100644
--- a/database.cc
+++ b/database.cc
@@ -264,11 +264,11 @@ column_family::make_reader(const query::partition_range& range) const {
         readers.emplace_back(mt->make_reader(range));
     }

-    if (_config.enable_cache) {
-        readers.emplace_back(_cache.make_reader(range));
-    } else {
+//    if (_config.enable_cache) {
+//        readers.emplace_back(_cache.make_reader(range));
+//    } else {
         readers.emplace_back(make_sstable_reader(range));
-    }
+//    }

     return make_combined_reader(std::move(readers));
 }

tgrabiec commented 9 years ago

2015-09-20 18:16 GMT+02:00 nyh notifications@github.com:

My investigation turned up a lot of mysterious stuff, and a workaround to make repair succeed:

First of all, it appears that the failing allocation isn't a particularly big one, just some code trying to copy a small mutation. It looks like we really are out of memory.

Second, the backtrace of the allocation failure led me discover that when the streamer (part of the repair) reads the sstables, it caches everything it reads in the row cache! This happens because column_family::make_reader() (which is what gets called under the hood) has the code
if (_config.enable_cache) {
    readers.emplace_back(_cache.make_reader(range));
} else {
    readers.emplace_back(make_sstable_reader(range));
}
And _cache.make_reader() appears to cache every single row it reads. @tgrabiec https://github.com/tgrabiec, is this deliberate? It seems wrong to me to overwrite the entire cache with all the data read during a repair (and also during compaction, etc.).

make_reader() is not only used by repair, but by all queries. The fact that cache misses populate the cache is deliberate, we want the same query to hit the next time. We should special-case this for repair so that cache won't get populated.

Finally, even if we do cache everything we read, why do we run out of memory? The cache needs to evict stuff once the memory gets full. It seems this eviction stuff is not working correctly. @tgrabiec https://github.com/tgrabiec , any idea how I can check it, or who is our expert on the eviction stuff?

The problem is that row_cache::populate() locks its region so that the iterator into partitions map remains valid. Because of that the cache is not evictable while we insert the entry. I think this can be trivially fixed, will send a patch.

nyh commented 9 years ago

On Sun, Sep 20, 2015 at 7:45 PM, Tomasz Grabiec notifications@github.com wrote:

The problem is that row_cache::populate() locks its region so that the iterator into partitions map remains valid. Because of that the cache is not evictable while we insert the entry. I think this can be trivially fixed, will send a patch.

I didn't understand what you said here, so a patch would indeed be helpful :-)

To clarify what I'm up against, in the repair scenario we have:

256 (!) local_readers being open in parallel, each opening a bunch of memtables readers for a different range, reading the partitions in that range and sending them to the remote machine.
In parallel to that, the remote machine is sending us partitions as quickly as it can, so we keep writing to memtables and flushing them to disk.

All that parallel activity surely requires memory, but I would think that 1 GB of memory is enough - and it is enough if caching is commented out.

So I don't think the problem is that a single partition can't be evicted while it's being inserted - doesn't the insertion of each partition take only a tiny time? It looks more like somehow nothing gets evicted? (so far, though, it's just a guess, I'll need to do more tests to be really sure what happens).

Nadav Har'El nyh@cloudius-systems.com

tgrabiec commented 9 years ago

Cache has a tendency to eat up all available memory. It is evicted on-demand, but this happens at certain points in time (during large allocation requests). Small allocations which are served from small object pools won't usually trigger this. Large allocations happen for example when LSA region needs a new segment, eg. when row cache is populated. If large allocations happen for certain period only inside row_cache::update(), then eviction will not be able to make forward progress because cache's LSA region is locked inside row_cache::update(). While it's locked, data can't be evicted from it. If you get std::bad_alloc from withing row_cache::update(), that would prove this theory.

Can you try the following patch?

diff --git a/row_cache.hh b/row_cache.hh
index dedffd4..25a66c5 100644
--- a/row_cache.hh
+++ b/row_cache.hh
@@ -162,6 +162,7 @@ private:
     partitions_type _partitions; // Cached partitions are complete.
     mutation_source _underlying;
     logalloc::allocating_section _update_section;
+    logalloc::allocating_section _populate_section;
     logalloc::allocating_section _read_section;
     mutation_reader make_scanning_reader(const query::partition_range&);
     void on_hit();
diff --git a/row_cache.cc b/row_cache.cc
index 1edce94..9b4c896 100644
--- a/row_cache.cc
+++ b/row_cache.cc
@@ -220,7 +220,7 @@ row_cache::~row_cache() {

 void row_cache::populate(const mutation& m) {
     with_allocator(_tracker.allocator(), [this, &m] {
-        logalloc::reclaim_lock _(_tracker.region());
+        _populate_section(_tracker.region(), [&] {
         auto i = _partitions.lower_bound(m.decorated_key(), cache_entry::compare(_schema));
         if (i == _partitions.end() || !i->key().equal(*_schema, m.decorated_key())) {
             cache_entry* entry = current_allocator().construct<cache_entry>(m.decorated_key(), m.partition());
@@ -231,6 +231,7 @@ void row_cache::populate(const mutation& m) {
             // We cache whole partitions right now, so if cache already has this partition,
             // it must be complete, so do nothing.
         }
+        });
     });
 }

-- 
1.9.1

nyh commented 9 years ago

On Sun, Sep 20, 2015 at 7:45 PM, Tomasz Grabiec notifications@github.com wrote:

And _cache.make_reader() appears to cache every single row it reads. @tgrabiec https://github.com/tgrabiec, is this deliberate? It seems wrong to me to overwrite the entire cache with all the data read during a repair (and also during compaction, etc.).

make_reader() is not only used by repair, but by all queries. The fact that cache misses populate the cache is deliberate, we want the same query to hit the next time. We should special-case this for repair so that cache won't get populated.

I've opened a separate issue #382 to follow this issue. The problem is that the column_family::make_reader(), which used to directly iterate over the sstable, without any layer of caching, now has caching, and all the layers that are built on it - such as storage_proxy::make_local_reader(), "inherit" this caching, whether or not this was intended. So we'll need to rethink this (although I guess it's not urgent now)/

nyh commented 9 years ago

@tgrabiec, your patch appears to work: While previously I couldn't repair 1 million partitions with 1 GB of memory per node, now I can. So while I can't say I understand what your patch does, I think we want it in, and mark that it fixes this issue.