storage: use `waitmap` for cache

skyzh commented 2 years ago

Currently, we are using moka as our block cache. But using such library poses several problems:

(1) Memory usage is not bounded. When an item is evicted from cache, it might still be used by the database system. e.g., someone might still hold Arc to a block.
(2) Doesn't support fast-path concurrent read. When there are multiple queries accessing the same block, and the block is not present in memory, the Column struct will fetch the same block multiple times.

Therefore, we should build our own block cache layer to avoid the problems.

Recently, I came up with https://boats.gitlab.io/blog/post/waitmap/, which seems to be able to solve (2). With some modification on the block struct, we might also be able to solve (1).

wangrunji0408 commented 2 years ago

For problem (1), would Weak be helpful?

jiangzhe commented 2 years ago

I think the cache layer is similar to the concept of buffer pool in tranditional database perspective. The technique to build a buffer pool can be used here, too: acquire, pin and replace. Acquire: One thread must ask buffer pool for the block, and buffer pool takes care of how to read it from disk. Pin: BufferPool always increase pins on the block before returning it to the thread, and the thread must unpin it after the read finishes. Replace: BufferPool can use clock or LRU strategy to find the proper block to evict if memory is insufficent. Any block that has pin count greater than 0 can not be evicted. If in read-only mode, the eviction is very simple, otherwise, dirty block must be persisted on disk before eviction. CMU database group has excellent online course on youtube about this topic.

skyzh commented 2 years ago

Exactly! Block cache in LSM systems is the same as buffer pool. But it's hard to implement the buffer pool in Rust using the same way as cpp (like bustub). The two challenges are:

How to make an async buffer pool? pin and unpin might wait for the buffer slot to be available. Seems not trivial to implement a high-performance one in Rust.
How to represent the ownership of one slot in buffer pool? If we take a &[u8] from buffer pool, everything using the pool needs a lifetime in Rust.

jiangzhe commented 2 years ago

Exactly! Block cache in LSM systems is the same as buffer pool. But it's hard to implement the buffer pool in Rust using the same way as cpp (like bustub). The two challenges are:
* How to make an async buffer pool? pin and unpin might wait for the buffer slot to be available. Seems not trivial to implement a high-performance one in Rust.

* How to represent the ownership of one slot in buffer pool? If we take a &[u8] from buffer pool, everything using the pool needs a lifetime in Rust.

To represent block/page, we can use Arc<RwLock<PageImpl>>.

struct Page(Arc<RwLock<PageImpl>>);
struct PageImpl {
    data: pu8; PAGE_SIZE as usize],
    pin_count: u32,
    dirty: bool,
}

To design an async buffer pool, my thought is to:

design an async interface to return a channel-like receiver, caller poll the receiver to get the exact page.
have a map maintaining inflight read requests, e.g. mappings of BlockID to Vec<Sender<Page>>.
for each request, generate a channel-like structure, add callback to disk IO to proxy the result to Sender, and returns Receiver to caller so that caller could recv().await it:
```
enum PageResp {
Ready(Page),
Wait(Receiver<Page>),
}
```

impl BufferPool { fn fetch_page(&self, page_id: PageID) -> PageResp { // first check if page is in memory, if yes, just return if let Some(page) = self.page_in_mem(page_id) { return PageResp::Ready(page) } // create a onetime channel for async operation. // caller can use rx.recv().await to wait for the IO operation done. let (mut tx, rx) = onetime_channel(); loop { match self.merge_sender_to_inflight(pageid, tx) { Ok() => { // if sender added into infight list, the background thread will send the result, // so just return rx return PageResp::Wait(rx) } Err(tx_fail_merge) => { // no inflight request for this page_id match self.insert_to_infight(page_id, vec![tx_failmerge]) { Ok() => { // start a background task to await disk manager IO // and send results to all callers runtime::spawn(async { let page = self.disk_manager.fetch_page().await; let senders = self.remove_inflight(page_id); for sender in senders { sender.send(page.clone()).await } }).detach(); return PageResp::Wait(rx) } Err(tx_fail_insert) => { // can not insert because anther thread just inserts the same // block concurrently, just retry tx = tx_fail_insert; } } } } } } }

jiangzhe commented 2 years ago

I think PageResp is actually anonying, and could be removed. We can make fetch_page method async, await inside it, and change the return type to Result<Page, Error>, which is more convenient to use.

async fn fetch_page(&self, page_id: PageID) -> Result<Page, Error>;

skyzh commented 2 years ago

(2) Doesn't support fast-path concurrent read. When there are multiple queries accessing the same block, and the block is not present in memory, the Column struct will fetch the same block multiple times.

... is already solved with https://github.com/risinglightdb/risinglight/pull/556

So the current problem is to control memory usage of our system... Maybe we need to implement a real BufferPool now 🤣

risinglightdb / risinglight

storage: use `waitmap` for cache #436