Current sqld code abuses tokio runtime which can lead to complete deadlock of the server.
One issue which got triggered pretty frequently relying on the following facts about sqld internals:
Many async request handlers in sqld checks that metadata has namespace in it with sync lock metadata.inner.configs (see MetaStore::exists method)
Metastore remove operation take metadata.inner.configs lock first and then try to take inner.conn lock
Background checkpoint operation which run on blocking thread take inner.conn lock for the time of the checkpoint process
So, if (2) took metadata.inner.configs lock while (3) is in process, then all request tasks which hits (1) will be blocked on the parking_lot mutex until (3) released the lock and (2) finished. This can easily block all tokio worker threads and lead to complete server deadlock.
This PR mitigate this specific scenario by introducing 2 fixes:
MetaStoreInner.configs and MetaStoreInner.conn now uses tokio::sync::Mutex instead of parking_lot::Mutex. This will prevent scenario from above as async tasks now will be blocked on async lock and runtime will be able to switch them from worker threads and put some other workload on them
Metastore remove operation now take inner.conn lock first and then take inner.configs lock. This will prevent the cases where inner.configs lock is taken for too long by remove operation while awaiting next inner.conn lock. As we are using configs lock in both async & sync context and also this lock required for quick checks in metastore - it's better to not hold it for too long.
Context
Current
sqld
code abuses tokio runtime which can lead to complete deadlock of the server.One issue which got triggered pretty frequently relying on the following facts about
sqld
internals:sqld
checks that metadata has namespace in it with sync lockmetadata.inner.configs
(seeMetaStore::exists
method)metadata.inner.configs
lock first and then try to takeinner.conn
lockinner.conn
lock for the time of the checkpoint processSo, if (2) took
metadata.inner.configs
lock while (3) is in process, then all request tasks which hits (1) will be blocked on the parking_lot mutex until (3) released the lock and (2) finished. This can easily block all tokio worker threads and lead to complete server deadlock.This PR mitigate this specific scenario by introducing 2 fixes:
MetaStoreInner.configs
andMetaStoreInner.conn
now usestokio::sync::Mutex
instead ofparking_lot::Mutex
. This will prevent scenario from above as async tasks now will be blocked on async lock and runtime will be able to switch them from worker threads and put some other workload on theminner.conn
lock first and then takeinner.configs
lock. This will prevent the cases whereinner.configs
lock is taken for too long by remove operation while awaiting nextinner.conn
lock. As we are usingconfigs
lock in both async & sync context and also this lock required for quick checks in metastore - it's better to not hold it for too long.