This PR provides a number of changes to improve handling of pressure within the kv_index_tictactree, especially in how it can signal back-pressure to the calling application to reduce workloads.
In volume tests on a cluster with >2TB per node on 8 nodes, if tictacaae is enabled for the first time in parallel mode when the cluster is under load and subject to deliberate node failures - then vnode crashes may be seen. In particular these are related to memory leaks caused by overlapping queries, or by timeouts on sync calls within the kv_index_tictcatree process heirarchy.
There are a number of changes merged in here to alleviate this:
A fix limit on the number of fetch-clocks queries that may concurrently be queued, to prevent memory exhaustion due to snapshots made for queued queries (for whom the calling process will inevitably have timed out before the query is run).
Make the aae_keystore respect the "pause" response from the leveled backend when it has a work backlog, and eventually feedback that pause if necessary (e.g. through slowed aae_ping).
Complete the load of the change queue which is built up during a keystore rebuild, in a yielding loop whilst still in the loading state - rather than sync wait while the whole change queue is forced into the backend, risking timeout and the creation of large bookie memories.
Allow the transition from rebuild store, to rebuild trees to wait until the loading state has cleared.
Allow for tree rebuild requests to be queued prior to the snapshot being made, so as not to retain long-lived snapshots for tree rebuilds, which may suffer from hitting snapshot timeout limits.
This PR provides a number of changes to improve handling of pressure within the kv_index_tictactree, especially in how it can signal back-pressure to the calling application to reduce workloads.
In volume tests on a cluster with >2TB per node on 8 nodes, if tictacaae is enabled for the first time in parallel mode when the cluster is under load and subject to deliberate node failures - then vnode crashes may be seen. In particular these are related to memory leaks caused by overlapping queries, or by timeouts on sync calls within the kv_index_tictcatree process heirarchy.
There are a number of changes merged in here to alleviate this:
A fix limit on the number of fetch-clocks queries that may concurrently be queued, to prevent memory exhaustion due to snapshots made for queued queries (for whom the calling process will inevitably have timed out before the query is run).
Make the aae_keystore respect the "pause" response from the leveled backend when it has a work backlog, and eventually feedback that pause if necessary (e.g. through slowed aae_ping).
Complete the load of the change queue which is built up during a keystore rebuild, in a yielding loop whilst still in the loading state - rather than sync wait while the whole change queue is forced into the backend, risking timeout and the creation of large bookie memories.
Allow the transition from rebuild store, to rebuild trees to wait until the loading state has cleared.
Allow for tree rebuild requests to be queued prior to the snapshot being made, so as not to retain long-lived snapshots for tree rebuilds, which may suffer from hitting snapshot timeout limits.