mongodb-partners / mongo-rocks

MongoDB storage integration layer for the Rocks storage engine
398 stars 101 forks source link

High cpu usage #3

Closed tanguylebarzic closed 9 years ago

tanguylebarzic commented 9 years ago

Hi,

Forgive me if this is not the correct place to post this. I'm trying mongo-rocks to see how it performs, and have set up a server with rocksdb as the storage engine to be a secondary of a replicaset. The initial sync took a while but finished successfully. However, now that the server is in the 'SECONDARY' state, it doesn't manage to stay in sync - the delay (difference between the optimeDate and the lastHeartbeat date) keeps growing. The CPU usage is always at close to 100%, while the other slaves in the replica (using mongodb 2.6.4, with the 'old' storage engine) stay up-to-date, with a CPU usage at around 13%. While I thought it could be due to compression (trading cpu cycles for storage), the difference just seems too big. All instances are on m3.large (2 cores) on AWS. The compression algorithm is snappy. The biggest collections are capped collections (don't know if it could have an impact). Happy to provide any additional information you may need!

igorcanadi commented 9 years ago

Hey @tanguylebarzic! Thanks for trying out mongo-rocks! The behavior you observe is likely due to capped collections -- we don't handle them well (except oplog capped collection -- we have special optimizations in place for that one).

tanguylebarzic commented 9 years ago

Thanks for the quick response @igorcanadi! Do you think TTL collections would suffer from the same issue? Thanks!

igorcanadi commented 9 years ago

TTL should be much better!

tanguylebarzic commented 9 years ago

Will try that then, thanks!

igorcanadi commented 9 years ago

@tanguylebarzic If you cherry-pick this commit: https://github.com/mongodb-partners/mongo/commit/4375146b5904e5949455659e504dfe058405da9b to your build, you'll see a big improvement. The problem is that mongo's higher level code requires the number of documents in capped collections to be strict (i.e. if you specify 100 documents, we're not allowed to have 101). This commit will fail the unit tests, but will work better because we don't have to delete one document for every insert we do.

tanguylebarzic commented 9 years ago

@igorcanadi Thanks for the adivce! I just tested this. It seems to be a bit better, but not by much (cpu is still at 100%, and it still becomes out of sync - a bit more slowly maybe). I'll try using ttls at some point (not an easy change), see if it behaves better.

igorcanadi commented 9 years ago

Curious -- can you share the CPU profile when CPU is at 100%?

tanguylebarzic commented 9 years ago

Sure! You mean running with --use-cpu-profiler, let it run for a while and capture the profile? If so, will try it once the instance has caught up again - probably tomorrow (was too stale).

igorcanadi commented 9 years ago

If you have a newer kernel I would recommend using perf events: https://perf.wiki.kernel.org/index.php/Tutorial. It's pretty neat.

igorcanadi commented 9 years ago

Or you can just run sudo perf top to see highest cost CPU

tanguylebarzic commented 9 years ago

Using perf top as suggested, here's the output (letting it run for a few minutes):

Samples: 576K of event 'cpu-clock', Event count (approx.): 30190718950
 17,37%  mongod               [.] rocksdb::BlockIter::ParseNextKey()
 14,31%  libc-2.19.so         [.] 0x0000000000160081
 11,14%  mongod               [.] rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const
  8,77%  mongod               [.] void std::__push_heap<__gnu_cxx::__normal_iterator<rocksdb::IteratorWrapper**, std::vector<rocksdb::IteratorWrapper*, std::allocator<rocksdb::IteratorWrapper*> > >, long, rocksdb::IteratorWrapper*, rocksd
  7,37%  mongod               [.] rocksdb::(anonymous namespace)::BytewiseComparatorImpl::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const
  5,25%  mongod               [.] void std::__adjust_heap<__gnu_cxx::__normal_iterator<rocksdb::IteratorWrapper**, std::vector<rocksdb::IteratorWrapper*, std::allocator<rocksdb::IteratorWrapper*> > >, long, rocksdb::IteratorWrapper*, rock
  4,87%  mongod               [.] rocksdb::MergingIterator::Next()
  4,68%  mongod               [.] rocksdb::DBIter::FindNextUserEntryInternal(bool)
  3,81%  mongod               [.] rocksdb::(anonymous namespace)::TwoLevelIterator::Next()
  3,21%  mongod               [.] rocksdb::(anonymous namespace)::ShardedLRUCache::Lookup(rocksdb::Slice const&)
  2,59%  mongod               [.] rocksdb::Block::NewIterator(rocksdb::Comparator const*, rocksdb::BlockIter*, bool)
  1,53%  mongod               [.] rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward()
  0,91%  mongod               [.] rocksdb::(anonymous namespace)::TwoLevelIterator::Valid() const
  0,88%  libpthread-2.19.so   [.] pthread_mutex_unlock
  0,67%  mongod               [.] rocksdb::(anonymous namespace)::TwoLevelIterator::key() const
  0,64%  mongod               [.] rocksdb::BlockIter::Valid() const
  0,60%  mongod               [.] rocksdb::MergingIterator::Valid() const
  0,51%  libpthread-2.19.so   [.] pthread_mutex_lock
  0,50%  mongod               [.] rocksdb::GetVarint32PtrFallback(char const*, char const*, unsigned int*)
  0,46%  mongod               [.] rocksdb::BlockIter::Next()
  0,41%  mongod               [.] rocksdb::MergingIterator::key() const
  0,40%  mongod               [.] rocksdb::(anonymous namespace)::ShardedLRUCache::Release(rocksdb::Cache::Handle*)
  0,40%  mongod               [.] rocksdb::BlockIter::key() const
  0,39%  mongod               [.] memcmp@plt
  0,36%  mongod               [.] rocksdb::MemTableIterator::key() const
  0,33%  mongod               [.] rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)
  0,29%  mongod               [.] operator new(unsigned long)
  0,27%  mongod               [.] rocksdb::(anonymous namespace)::SkipListRep::Iterator::key() const
  0,25%  perf                 [.] 0x0000000000058579
  0,23%  mongod               [.] operator delete(void*)
  0,22%  mongod               [.] rocksdb::MemTable::KeyComparator::operator()(char const*, char const*) const
  0,21%  mongod               [.] rocksdb::SkipList<char const*, rocksdb::MemTableRep::KeyComparator const&>::FindGreaterOrEqual(char const* const&, rocksdb::SkipList<char const*, rocksdb::MemTableRep::KeyComparator const&>::Node**) const
  0,21%  libstdc++.so.6.0.19  [.] std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
  0,21%  mongod               [.] rocksdb::Hash(char const*, unsigned long, unsigned int)
  0,16%  mongod               [.] rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
  0,12%  mongod               [.] memcpy@plt
  0,11%  mongod               [.] rocksdb::BlockIter::BinarySeek(rocksdb::Slice const&, unsigned int, unsigned int, unsigned int*)
  0,10%  mongod               [.] rocksdb::GetVarint64Ptr(char const*, char const*, unsigned long*)
  0,10%  mongod               [.] rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*)
  0,10%  mongod               [.] rocksdb::BlockIter::SeekToFirst()
  0,09%  mongod               [.] rocksdb::BlockIter::status() const
  0,08%  mongod               [.] rocksdb::(anonymous namespace)::GetCacheKey(char const*, unsigned long, rocksdb::BlockHandle const&, char*)
  0,08%  mongod               [.] rocksdb::(anonymous namespace)::TwoLevelIterator::SetSecondLevelIterator(rocksdb::Iterator*)
  0,07%  mongod               [.] rocksdb::WriteBatchEntryComparator::operator()(rocksdb::WriteBatchIndexEntry const*, rocksdb::WriteBatchIndexEntry const*) const
  0,07%  mongod               [.] rocksdb::MemTableIterator::Next()
  0,07%  mongod               [.] rocksdb::WriteBatchEntryComparator::CompareKey(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&) const
  0,07%  mongod               [.] rocksdb::BlockBasedTable::GetDataBlockFromCache(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::Statistics*, rocksdb::ReadOptions const&, rocksdb::BlockBasedTable:
  0,07%  mongod               [.] rocksdb::BlockHandle::DecodeFrom(rocksdb::Slice*)
  0,07%  mongod               [.] rocksdb::(anonymous namespace)::SkipListRep::Iterator::Next()
  0,07%  libstdc++.so.6.0.19  [.] std::string::assign(char const*, unsigned long)
  0,07%  mongod               [.] mongo::BSONElement::size() const
  0,06%  mongod               [.] rocksdb::(anonymous namespace)::GetEntryFromCache(rocksdb::Cache*, rocksdb::Slice const&, rocksdb::Tickers, rocksdb::Tickers, rocksdb::Statistics*)
  0,06%  libstdc++.so.6.0.19  [.] std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long)
  0,06%  mongod               [.] rocksdb::ReadableWriteBatch::GetEntryFromDataOffset(unsigned long, rocksdb::WriteType*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) const
  0,06%  libc-2.19.so         [.] strlen
  0,05%  mongod               [.] rocksdb::BlockIter::~BlockIter()
  0,05%  mongod               [.] rocksdb::Iterator::~Iterator()
  0,05%  [kernel]             [k] _raw_spin_unlock_irqrestore
  0,05%  mongod               [.] rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)
  0,04%  mongod               [.] MurmurHash3_x64_128(void const*, int, unsigned int, void*)
  0,04%  mongod               [.] std::_Hashtable<mongo::StringData, std::pair<mongo::StringData const, std::pair<unsigned long, std::_List_iterator<std::pair<std::string, unsigned long> > > >, std::allocator<std::pair<mongo::StringData c
  0,04%  mongod               [.] mongo::BSONObj::getField(mongo::StringData const&) const
  0,04%  [kernel]             [k] finish_task_switch

Don't hesitate if you want more!

igorcanadi commented 9 years ago

Yup, exactly what I expected, thanks. So what happens here is that we need to loop through all of the tombstones at the beginning of the capped collection. To clean up the beginning of the FIFO capped collection we need to do:

  1. seek to the first element
  2. delete couple of elements starting from the first element

Because RocksDB is built on LSM architecture deletes are actually just tombstones. Thus, the first operation needs to loop through a lot of tombstones to get this first element. The problem just grows as we delete more and more data and get more and more tombstones. Cassandra has the same problem: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

I actually have an idea how to optimize this. Let me try it out and get back to you.

igorcanadi commented 9 years ago

Hey @tanguylebarzic, can you try the new mongorocks? I just pushed a patch that will speed up insertions into capped collections a lot: https://github.com/mongodb-partners/mongo/commit/49cfd2091746ad9ebbdd64430ee5c54b40bfb94e

https://github.com/mongodb-partners/mongo/commits/v3.0-mongorocks

Let me know if it helped.

tanguylebarzic commented 9 years ago

Hi @igorcanadi, I'm actually having a hard time building mongo with rocksdb now. What I did:

src/mongo/db/modules/rocks/src/rocks_init.cpp:33:38: fatal error: mongo/db/service_context.h: No such file or directory
 #include "mongo/db/service_context.h"

Indeed, there is not such file in the v3.0-mongorocks branch. There is in the master branch of https://github.com/mongodb/mongo, but I can't build with it either (SCons Error: no such option: --rocksdb, which was kinda expected).

I'll try to sort it out today or tomorrow.

igorcanadi commented 9 years ago

No need to build or clone https://github.com/mongodb-partners/mongo-rocks. This is the model we're moving towards for our 3.2 release. v3.0 release is all contained under v3.0-mongorocks branch. More detailed instructions here: http://blog.parse.com/announcements/mongodb-rocksdb-parse/

tanguylebarzic commented 9 years ago

Hi @igorcanadi,

This was what I'd done the first time, but for some reason tried to install differently the second time. Anyway, it's sorted out now. As for the cpu usage with the latest code: It helps a lot! The cpu usage pattern is a bit different from the other - normal mongo - slaves (ie. a bit less cpu usage generally, with big spikes every 6 minutes or so), it's definitely a lot better now, and the slave stays up-to-date without issues. I can continue evaluating rocksdb as a storage engine :) Thanks a lot for your work and comments!

igorcanadi commented 9 years ago

Great! The spikes are probably compactions. :) Let us know about the results!