Currently the WriteBufferManager is a passive entity that can be queried for the state of the memory quota and let the caller act upon it. In the case of flushes, this means that the WriteBufferManager exposes a ShouldFlush() method that writers are calling during the write flow and trigger flushes for the DB that the writer is writing into. The algorithm for choosing what to flush in that database is pretty rudimentary, and either does atomic flush (of all CFs) if needed, or chooses the CF that holds the oldest data. This means that in a multi-DB scenario, only the active DB will be the one flushing, even if most of the memtable memory is held by inactive databases. In a multi-CF scenario without atomic flush, the current algorithm may not help much, because the CF that holds the oldest data isn't necessarily using much memory.
Allow fixing this behavior by letting the WriteBufferManager be aware of databases that depend on it for quota management and proactively initiating flushes as needed. This requires that each database register itself for information extraction and flush requests on DB open and unregister on closing. The information extraction part is needed so that the WriteBufferManager could choose the most suitable DB for flush requests (the one that will release the most memory), and the flush triggering for allowing memory to be released.
The triggering of flushes should be done once total memory usage exceeds a certain threshold (the specifics of which can be determined later. Currently we have two options: either trigger flushes once we start delaying writes (#114), or do this periodically on every e.g. 25% increase in used memory). The WriteBufferManager should then iterate over the registered databases and query them for the amount of memory used, broken down into immutable and mutable memory, and choose the database with the most potential for memory reduction (in a multi-CF database this may not be ideal, because if we have many small CFs we will not release as much memory as we're hoping to, so we may need to break that information down further by CF in order to let the WBM choose in a better way).
On the database side, when a flush request is received, in the case of atomic flush the behaviour will be exactly as it is today wrt. choosing the CFs for flushing. However, if we're not doing an atomic flush, the logic should be changed to choose the CF with the most memory to free (there's no point in choosing the oldest, since that is relevant only for when the WAL size limit is reached and is already done there). Additionally, depending on complexity, we may want to skip switching active memtables on chosen CFs, and just request to flush the immutable ones if they would free enough memory (if we go that route we should probably change the logic on the WBM side to choose the DB with the most potential for freeing immutable memory first before we go hunting down mutable memory as well).
Note that adding data to memtables (and by extension, calling into the WBM for reserving memory) isn't done under the DB mutex, so we'll need to lock it when triggering flushes (the current code that queries the WBM and triggers flushes is done as part of DBImpl::PreprocessWrite() while the mutex is being held).
Additionally, we may want to expose the difference between memory that has already been marked for flush (and cannot be freed by triggering a flush) and immutable memory which hasn't been marked yet (#113) in the information query interface and use that when considering which DB to choose (and we may also forgo trying to choose a DB if e.g. more than 50% of the total memory is already marked for flush, since adding another flush job wouldn't necessarily help here).
Lastly, we may need to have a notification mechanism that allows the WBM to know when a flush request has actually been processed (picked from the flush request queue and memtables picked), so that we would be able to mark databases where flush has already been requested and not request again until that request has been processed, so that under memory pressure we'll simply choose another database instead of sending another request based on the intermediate state between requesting the flush and it being processed.
Currently the
WriteBufferManager
is a passive entity that can be queried for the state of the memory quota and let the caller act upon it. In the case of flushes, this means that theWriteBufferManager
exposes aShouldFlush()
method that writers are calling during the write flow and trigger flushes for the DB that the writer is writing into. The algorithm for choosing what to flush in that database is pretty rudimentary, and either does atomic flush (of all CFs) if needed, or chooses the CF that holds the oldest data. This means that in a multi-DB scenario, only the active DB will be the one flushing, even if most of the memtable memory is held by inactive databases. In a multi-CF scenario without atomic flush, the current algorithm may not help much, because the CF that holds the oldest data isn't necessarily using much memory.Allow fixing this behavior by letting the
WriteBufferManager
be aware of databases that depend on it for quota management and proactively initiating flushes as needed. This requires that each database register itself for information extraction and flush requests on DB open and unregister on closing. The information extraction part is needed so that theWriteBufferManager
could choose the most suitable DB for flush requests (the one that will release the most memory), and the flush triggering for allowing memory to be released.The triggering of flushes should be done once total memory usage exceeds a certain threshold (the specifics of which can be determined later. Currently we have two options: either trigger flushes once we start delaying writes (#114), or do this periodically on every e.g. 25% increase in used memory). The
WriteBufferManager
should then iterate over the registered databases and query them for the amount of memory used, broken down into immutable and mutable memory, and choose the database with the most potential for memory reduction (in a multi-CF database this may not be ideal, because if we have many small CFs we will not release as much memory as we're hoping to, so we may need to break that information down further by CF in order to let the WBM choose in a better way).On the database side, when a flush request is received, in the case of atomic flush the behaviour will be exactly as it is today wrt. choosing the CFs for flushing. However, if we're not doing an atomic flush, the logic should be changed to choose the CF with the most memory to free (there's no point in choosing the oldest, since that is relevant only for when the WAL size limit is reached and is already done there). Additionally, depending on complexity, we may want to skip switching active memtables on chosen CFs, and just request to flush the immutable ones if they would free enough memory (if we go that route we should probably change the logic on the WBM side to choose the DB with the most potential for freeing immutable memory first before we go hunting down mutable memory as well).
Note that adding data to memtables (and by extension, calling into the WBM for reserving memory) isn't done under the DB mutex, so we'll need to lock it when triggering flushes (the current code that queries the WBM and triggers flushes is done as part of
DBImpl::PreprocessWrite()
while the mutex is being held).Additionally, we may want to expose the difference between memory that has already been marked for flush (and cannot be freed by triggering a flush) and immutable memory which hasn't been marked yet (#113) in the information query interface and use that when considering which DB to choose (and we may also forgo trying to choose a DB if e.g. more than 50% of the total memory is already marked for flush, since adding another flush job wouldn't necessarily help here).
Lastly, we may need to have a notification mechanism that allows the WBM to know when a flush request has actually been processed (picked from the flush request queue and memtables picked), so that we would be able to mark databases where flush has already been requested and not request again until that request has been processed, so that under memory pressure we'll simply choose another database instead of sending another request based on the intermediate state between requesting the flush and it being processed.