intel / CacheLib

Pluggable in-process caching engine to build and scale high performance services
https://www.cachelib.org
Apache License 2.0
17 stars 4 forks source link

Stress workers get stuck forever #64

Closed mita closed 1 year ago

mita commented 1 year ago

When I run cachebench with my test config, stress workers get stuck forever.

test.json.txt

Total 3600.00M ops to be run
00:50:51       0.00M ops completed. Hit Ratio  -0.84%
00:51:51     703.34M ops completed. Hit Ratio  82.34%
00:52:51    1344.09M ops completed. Hit Ratio  85.85%
00:53:51    1923.65M ops completed. Hit Ratio  85.83%
00:54:51    2498.17M ops completed. Hit Ratio  85.83%
00:55:51    3033.92M ops completed. Hit Ratio  85.83%
00:56:51    3034.06M ops completed. Hit Ratio  85.75%
00:57:51    3034.06M ops completed. Hit Ratio   0.00%
00:58:51    3034.06M ops completed. Hit Ratio   0.00%

Backtrace of the stress worker

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fd61bce4b8e in folly::detail::(anonymous namespace)::nativeFutexWaitImpl (waitMask=4294967295, absSteadyTime=0x0, absSystemTime=0x0, expected=4294967293, addr=0x7fd04801ffd8)
    at /home/mita/CacheLib-intel/cachelib/external/folly/folly/detail/Futex.cpp:126
#2  folly::detail::futexWaitImpl (futex=futex@entry=0x7fd04801ffd8, expected=expected@entry=4294967293, absSystemTime=absSystemTime@entry=0x0, absSteadyTime=absSteadyTime@entry=0x0, waitMask=waitMask@entry=4294967295)
    at /home/mita/CacheLib-intel/cachelib/external/folly/folly/detail/Futex.cpp:254
#3  0x00007fd61bde2896 in folly::detail::futexWait<std::atomic<unsigned int> > (waitMask=4294967295, expected=4294967293, futex=0x7fd04801ffd8) at /home/mita/CacheLib-intel/cachelib/external/folly/folly/detail/Futex-inl.h:94
#4  folly::detail::MemoryIdler::futexWait<std::atomic<unsigned int>, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (expected=4294967293, waitMask=4294967295, stackToRetain=1024, timeoutVariationFrac=0.5,
    idleTimeout=..., fut=...) at /home/mita/CacheLib-intel/cachelib/external/folly/folly/detail/MemoryIdler.h:126
#5  folly::fibers::Baton::waitThread (this=0x7fd04801ffd8) at /home/mita/CacheLib-intel/cachelib/external/folly/folly/fibers/Baton.cpp:70
#6  0x00007fd61bde2b40 in folly::fibers::Baton::wait<folly::fibers::Baton::wait()::<lambda()> > (mainContextFunc=..., this=0x7fd04801ffd8) at /home/mita/CacheLib-intel/cachelib/external/folly/folly/fibers/Baton.h:46
#7  folly::fibers::Baton::wait (this=0x7fd04801ffd8) at /home/mita/CacheLib-intel/cachelib/external/folly/folly/fibers/Baton.cpp:46
#8  0x0000562ea64a1b95 in facebook::cachelib::detail::ReadHandleImpl<facebook::cachelib::CacheItem<facebook::cachelib::LruCacheTrait> >::ItemWaitContext::wait (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:734
#9  0x0000562ea6602938 in facebook::cachelib::detail::ReadHandleImpl<facebook::cachelib::CacheItem<facebook::cachelib::LruCacheTrait> >::ItemWaitContext::get (this=0x7fd04801ffd0)
    at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/Handle.h:260
#10 facebook::cachelib::detail::ReadHandleImpl<facebook::cachelib::CacheItem<facebook::cachelib::LruCacheTrait> >::getInternal (this=0x7fd06b7f4c50) at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/Handle.h:251
#11 facebook::cachelib::detail::ReadHandleImpl<facebook::cachelib::CacheItem<facebook::cachelib::LruCacheTrait> >::get (this=0x7fd06b7f4c50) at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/Handle.h:166
#12 facebook::cachelib::detail::ReadHandleImpl<facebook::cachelib::CacheItem<facebook::cachelib::LruCacheTrait> >::operator bool (this=0x7fd06b7f4c50) at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/Handle.h:156
#13 facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait>::findFastInternal (mode=facebook::cachelib::AccessMode::kRead, key=..., this=<optimized out>)
    at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/CacheAllocator-inl.h:2132
#14 facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait>::findImpl (mode=facebook::cachelib::AccessMode::kRead, key=..., this=<optimized out>)
    at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/CacheAllocator-inl.h:2184
#15 facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait>::find (this=0x562ea7ce2350, key=...) at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/CacheAllocator-inl.h:2242
#16 0x0000562ea64aa425 in facebook::cachelib::cachebench::Cache<facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait> >::find(facebook::cachelib::KAllocation::Key)::{lambda()#1}::operator()() const (
    this=0x562ea7cdb600) at /usr/include/c++/9/bits/unique_ptr.h:360
#17 0x0000562ea64c2e8d in facebook::cachelib::cachebench::Cache<facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait> >::find (this=<optimized out>, key=...)
    at /home/mita/CacheLib-intel/cachelib/../cachelib/allocator/Handle.h:475
#18 0x0000562ea64c4d9a in facebook::cachelib::cachebench::CacheStressor<facebook::cachelib::CacheAllocator<facebook::cachelib::LruCacheTrait> >::stressByDiscreteDistribution (this=<optimized out>, stats=...)
    at /usr/include/c++/9/bits/basic_string.h:2316
#19 0x00007fd61b90cde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007fd61ba20609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007fd61b5f7133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

FWIW, I can't reproduce this problem without commit 68e66394 ("Create token before marking item as exclusive").

byrnedj commented 1 year ago

Hi, thanks for reporting this. And thanks for posting the config!

We spent the last few weeks chasing this race condition down. It should now be fixed by f4e30a7f3c878bc119577a11d2ca69c72d41a925 and 08bf0b4e8510eb1646ccad057e6209b937b6fe89 which have been merged to our develop branch.

Let us know if you are still hitting this deadlock - and if you don't mind sharing - are you experimenting with tiered memory? If so, what does your environment look like?

mita commented 1 year ago

Thank you. The problem doesn't happen anymore. We are planning to evaluate it with a tiered memory system, but we have not prepared an evaluation machine yet.