Closed tanguylebarzic closed 9 years ago
Hey @tanguylebarzic! Thanks for trying out mongo-rocks! The behavior you observe is likely due to capped collections -- we don't handle them well (except oplog capped collection -- we have special optimizations in place for that one).
Thanks for the quick response @igorcanadi! Do you think TTL collections would suffer from the same issue? Thanks!
TTL should be much better!
Will try that then, thanks!
@tanguylebarzic If you cherry-pick this commit: https://github.com/mongodb-partners/mongo/commit/4375146b5904e5949455659e504dfe058405da9b to your build, you'll see a big improvement. The problem is that mongo's higher level code requires the number of documents in capped collections to be strict (i.e. if you specify 100 documents, we're not allowed to have 101). This commit will fail the unit tests, but will work better because we don't have to delete one document for every insert we do.
@igorcanadi Thanks for the adivce! I just tested this. It seems to be a bit better, but not by much (cpu is still at 100%, and it still becomes out of sync - a bit more slowly maybe). I'll try using ttls at some point (not an easy change), see if it behaves better.
Curious -- can you share the CPU profile when CPU is at 100%?
Sure! You mean running with --use-cpu-profiler, let it run for a while and capture the profile? If so, will try it once the instance has caught up again - probably tomorrow (was too stale).
If you have a newer kernel I would recommend using perf events: https://perf.wiki.kernel.org/index.php/Tutorial. It's pretty neat.
Or you can just run sudo perf top
to see highest cost CPU
Using perf top as suggested, here's the output (letting it run for a few minutes):
Samples: 576K of event 'cpu-clock', Event count (approx.): 30190718950
17,37% mongod [.] rocksdb::BlockIter::ParseNextKey()
14,31% libc-2.19.so [.] 0x0000000000160081
11,14% mongod [.] rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const
8,77% mongod [.] void std::__push_heap<__gnu_cxx::__normal_iterator<rocksdb::IteratorWrapper**, std::vector<rocksdb::IteratorWrapper*, std::allocator<rocksdb::IteratorWrapper*> > >, long, rocksdb::IteratorWrapper*, rocksd
7,37% mongod [.] rocksdb::(anonymous namespace)::BytewiseComparatorImpl::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const
5,25% mongod [.] void std::__adjust_heap<__gnu_cxx::__normal_iterator<rocksdb::IteratorWrapper**, std::vector<rocksdb::IteratorWrapper*, std::allocator<rocksdb::IteratorWrapper*> > >, long, rocksdb::IteratorWrapper*, rock
4,87% mongod [.] rocksdb::MergingIterator::Next()
4,68% mongod [.] rocksdb::DBIter::FindNextUserEntryInternal(bool)
3,81% mongod [.] rocksdb::(anonymous namespace)::TwoLevelIterator::Next()
3,21% mongod [.] rocksdb::(anonymous namespace)::ShardedLRUCache::Lookup(rocksdb::Slice const&)
2,59% mongod [.] rocksdb::Block::NewIterator(rocksdb::Comparator const*, rocksdb::BlockIter*, bool)
1,53% mongod [.] rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward()
0,91% mongod [.] rocksdb::(anonymous namespace)::TwoLevelIterator::Valid() const
0,88% libpthread-2.19.so [.] pthread_mutex_unlock
0,67% mongod [.] rocksdb::(anonymous namespace)::TwoLevelIterator::key() const
0,64% mongod [.] rocksdb::BlockIter::Valid() const
0,60% mongod [.] rocksdb::MergingIterator::Valid() const
0,51% libpthread-2.19.so [.] pthread_mutex_lock
0,50% mongod [.] rocksdb::GetVarint32PtrFallback(char const*, char const*, unsigned int*)
0,46% mongod [.] rocksdb::BlockIter::Next()
0,41% mongod [.] rocksdb::MergingIterator::key() const
0,40% mongod [.] rocksdb::(anonymous namespace)::ShardedLRUCache::Release(rocksdb::Cache::Handle*)
0,40% mongod [.] rocksdb::BlockIter::key() const
0,39% mongod [.] memcmp@plt
0,36% mongod [.] rocksdb::MemTableIterator::key() const
0,33% mongod [.] rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)
0,29% mongod [.] operator new(unsigned long)
0,27% mongod [.] rocksdb::(anonymous namespace)::SkipListRep::Iterator::key() const
0,25% perf [.] 0x0000000000058579
0,23% mongod [.] operator delete(void*)
0,22% mongod [.] rocksdb::MemTable::KeyComparator::operator()(char const*, char const*) const
0,21% mongod [.] rocksdb::SkipList<char const*, rocksdb::MemTableRep::KeyComparator const&>::FindGreaterOrEqual(char const* const&, rocksdb::SkipList<char const*, rocksdb::MemTableRep::KeyComparator const&>::Node**) const
0,21% libstdc++.so.6.0.19 [.] std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
0,21% mongod [.] rocksdb::Hash(char const*, unsigned long, unsigned int)
0,16% mongod [.] rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock()
0,12% mongod [.] memcpy@plt
0,11% mongod [.] rocksdb::BlockIter::BinarySeek(rocksdb::Slice const&, unsigned int, unsigned int, unsigned int*)
0,10% mongod [.] rocksdb::GetVarint64Ptr(char const*, char const*, unsigned long*)
0,10% mongod [.] rocksdb::ReadRecordFromWriteBatch(rocksdb::Slice*, char*, unsigned int*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*)
0,10% mongod [.] rocksdb::BlockIter::SeekToFirst()
0,09% mongod [.] rocksdb::BlockIter::status() const
0,08% mongod [.] rocksdb::(anonymous namespace)::GetCacheKey(char const*, unsigned long, rocksdb::BlockHandle const&, char*)
0,08% mongod [.] rocksdb::(anonymous namespace)::TwoLevelIterator::SetSecondLevelIterator(rocksdb::Iterator*)
0,07% mongod [.] rocksdb::WriteBatchEntryComparator::operator()(rocksdb::WriteBatchIndexEntry const*, rocksdb::WriteBatchIndexEntry const*) const
0,07% mongod [.] rocksdb::MemTableIterator::Next()
0,07% mongod [.] rocksdb::WriteBatchEntryComparator::CompareKey(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&) const
0,07% mongod [.] rocksdb::BlockBasedTable::GetDataBlockFromCache(rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::Statistics*, rocksdb::ReadOptions const&, rocksdb::BlockBasedTable:
0,07% mongod [.] rocksdb::BlockHandle::DecodeFrom(rocksdb::Slice*)
0,07% mongod [.] rocksdb::(anonymous namespace)::SkipListRep::Iterator::Next()
0,07% libstdc++.so.6.0.19 [.] std::string::assign(char const*, unsigned long)
0,07% mongod [.] mongo::BSONElement::size() const
0,06% mongod [.] rocksdb::(anonymous namespace)::GetEntryFromCache(rocksdb::Cache*, rocksdb::Slice const&, rocksdb::Tickers, rocksdb::Tickers, rocksdb::Statistics*)
0,06% libstdc++.so.6.0.19 [.] std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long)
0,06% mongod [.] rocksdb::ReadableWriteBatch::GetEntryFromDataOffset(unsigned long, rocksdb::WriteType*, rocksdb::Slice*, rocksdb::Slice*, rocksdb::Slice*) const
0,06% libc-2.19.so [.] strlen
0,05% mongod [.] rocksdb::BlockIter::~BlockIter()
0,05% mongod [.] rocksdb::Iterator::~Iterator()
0,05% [kernel] [k] _raw_spin_unlock_irqrestore
0,05% mongod [.] rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)
0,04% mongod [.] MurmurHash3_x64_128(void const*, int, unsigned int, void*)
0,04% mongod [.] std::_Hashtable<mongo::StringData, std::pair<mongo::StringData const, std::pair<unsigned long, std::_List_iterator<std::pair<std::string, unsigned long> > > >, std::allocator<std::pair<mongo::StringData c
0,04% mongod [.] mongo::BSONObj::getField(mongo::StringData const&) const
0,04% [kernel] [k] finish_task_switch
Don't hesitate if you want more!
Yup, exactly what I expected, thanks. So what happens here is that we need to loop through all of the tombstones at the beginning of the capped collection. To clean up the beginning of the FIFO capped collection we need to do:
Because RocksDB is built on LSM architecture deletes are actually just tombstones. Thus, the first operation needs to loop through a lot of tombstones to get this first element. The problem just grows as we delete more and more data and get more and more tombstones. Cassandra has the same problem: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
I actually have an idea how to optimize this. Let me try it out and get back to you.
Hey @tanguylebarzic, can you try the new mongorocks? I just pushed a patch that will speed up insertions into capped collections a lot: https://github.com/mongodb-partners/mongo/commit/49cfd2091746ad9ebbdd64430ee5c54b40bfb94e
https://github.com/mongodb-partners/mongo/commits/v3.0-mongorocks
Let me know if it helped.
Hi @igorcanadi, I'm actually having a hard time building mongo with rocksdb now. What I did:
src/mongo/db/modules/rocks/src/rocks_init.cpp:33:38: fatal error: mongo/db/service_context.h: No such file or directory
#include "mongo/db/service_context.h"
Indeed, there is not such file in the v3.0-mongorocks branch. There is in the master branch of https://github.com/mongodb/mongo, but I can't build with it either (SCons Error: no such option: --rocksdb, which was kinda expected).
I'll try to sort it out today or tomorrow.
No need to build or clone https://github.com/mongodb-partners/mongo-rocks. This is the model we're moving towards for our 3.2 release. v3.0 release is all contained under v3.0-mongorocks
branch. More detailed instructions here: http://blog.parse.com/announcements/mongodb-rocksdb-parse/
Hi @igorcanadi,
This was what I'd done the first time, but for some reason tried to install differently the second time. Anyway, it's sorted out now. As for the cpu usage with the latest code: It helps a lot! The cpu usage pattern is a bit different from the other - normal mongo - slaves (ie. a bit less cpu usage generally, with big spikes every 6 minutes or so), it's definitely a lot better now, and the slave stays up-to-date without issues. I can continue evaluating rocksdb as a storage engine :) Thanks a lot for your work and comments!
Great! The spikes are probably compactions. :) Let us know about the results!
Hi,
Forgive me if this is not the correct place to post this. I'm trying mongo-rocks to see how it performs, and have set up a server with rocksdb as the storage engine to be a secondary of a replicaset. The initial sync took a while but finished successfully. However, now that the server is in the 'SECONDARY' state, it doesn't manage to stay in sync - the delay (difference between the optimeDate and the lastHeartbeat date) keeps growing. The CPU usage is always at close to 100%, while the other slaves in the replica (using mongodb 2.6.4, with the 'old' storage engine) stay up-to-date, with a CPU usage at around 13%. While I thought it could be due to compression (trading cpu cycles for storage), the difference just seems too big. All instances are on m3.large (2 cores) on AWS. The compression algorithm is snappy. The biggest collections are capped collections (don't know if it could have an impact). Happy to provide any additional information you may need!