Add parallel redo recovery.

RahulKushwaha commented 7 months ago

Is this already available in current versions of innodb or something that needs fresh design?

RahulKushwaha commented 7 months ago

I was going through the codebase to learn more about this. On a high level, I see the pattern will be something like the following:

Single Threaded Reader -> Multi-Threaded Applicator

Each thread of applicator will only apply logs in order. Logs for particular block will always be routed to the same thread. Single threaded reader can also be made multi-threaded but with the additional complexity of maintaining log order, et al. Don't know how much speed up we will get given the fact we are just reading logs and pushing to applicators.

Does this make sense?

sunbains commented 7 months ago

It's something that has to be implemented, I had implemented a POC about 4 years ago. The read can also be made parallel, it's fairly straight forward. In my experiments the parallel read is the easy part and IIRC it was reading at whatever was the disk throughput.

You are correct about queuing the redo log records per page, but you have to coordinate it with file create/drop too. Skip truncated table spaces etc. quite straight forward.

I would prefer to do it like the POC, extract and abstract the redo log apply and read into log record types and something that can be compiled and run stand alone with unit tests. Currently the apply code is part of the code where it is applied.

The medium challenge is to do it in the background, so that there is instant startup. The other thing to do properly is the read ahead. If the table spaces are on different disks then try to exploit that while applying and flushing the results. Make it restartable. This last part of restartable I haven't thought through.

sunbains commented 7 months ago

The way I had implemented it was to write a redo log stream reader with methods to read uint8..uint64 and n bytes. These methods returned the value as an out parameters and read status as return value in native byte order.Reading then becomes quite simple.

The trick was to use vector IO to make the redo log content contiguous, after determining the redo log segments to read in parallel, this simplified the streaming. The current way is a bit cumbersome.

RahulKushwaha commented 7 months ago

I can take a stab at it. :)

My understanding: Things start from recv_recovery_from_checkpoint_start_func. And then we go through all the files in the log group calling recv_group_scan_log_recs. Here we read the segment in the buffer, and then parse it in recv_scan_log_recs. recv_scan_log_recs also adds to hash and if the hash is full then we apply the records as well. And while parsing and stuff we call recv_parse_or_apply_log_rec_body to do the actual parsing job.

Do we intend to use redo log stream to replace functionality like mlog_parse_nbytes, mlog_parse_index, etc. I also see use of buffer pointer in methods like trx_undo_parse_add_undo_rec that read buffer pointer and return a new pointer which replaces the old pointer(suggesting we read stuff and want to move forward).

What I understand from the comments is that we will read files/segments in parallel to a buffer, and this redo log stream reader is an abstraction on top of that buffer(we can read ahead in that buffer, etc). Seems like the parsing component is the user of this redo log stream reader? And the methods I mentioned mlog_parse_nbytes and trx_undo_parse_add_undo_rec will use them?

sunbains commented 7 months ago

Go for it. I would prefer a full redesign and rewrite of this. You can use the existing code as a guide. Yes, get rid of all the mlogparse* etc.

Yes, the stream will work over a buffer, the buffer can be be smaller than the segment which it handles. The stream will do the automatic fetch from disk and make it transparent to the log stream parser. The parser will generate the redo log records that are queued to the page specific apply queue.

I also want to see how much we can leverage C++20 coroutines.

RahulKushwaha commented 7 months ago

Makes sense. Coroutines should be fun to integrate :)

RahulKushwaha commented 7 months ago

Things so far: It has been fun working on this :)

Created a class, redo_log_stream that allows iterating over recv_t records. Once we finish the current set of records we fetch the next batch.
Created another class redo_log_applicator that has a public method to add records, and then internally it applies in parallel.

I will refactor the redo_log_stream code more after I get this setup working(essentially as discussed to read ints, etc). Currently, I am reutilizing the exiting methods to read, fetch, and apply.

I still have bugs in the code, and working on fixing.

Question: I see that the records that are read are placed on the heap memory using the following code mem_heap_alloc(recv_sys->heap, sizeof(recv_t)). But I could not find a way to release this memory(maybe did not look hard enough). I can release the whole memory though.

So essentially, my log_stream keeps on reading and then sending to applicator but has to wait until all the records have been applied so that I can release all the memory using mem_heap_empty(recv_sys->heap) and then resume fetching and applying. This release becomes a rendezvous point :D

Am I right in this aspect?

(branch: https://github.com/RahulKushwaha/embedded-innodb/tree/rahul-parallel-redo-log)

RahulKushwaha commented 7 months ago

Implemented a POC to get a sense of code.

What it does? redo_log_stream has the responsibility of reading records from files and providing its users a method to read records one by one. It stops when we run out of heap memory(any further calls just return the same error code until the heap memory is cleared). The caller has the responsibility to free it when it is done with those records. And then it can start producing more records.

redo_log_applicator allows adding records, and internally it can apply in parallel. Currently it accepts the records and pushes it block specific queue. And each queue specific threads pulls records out, and applied them to the pages.

ib_recovery test is able to pass!

Draft PR: https://github.com/sunbains/embedded-innodb/pull/43

Next steps:

redo_log_applicator: make it parallel, remove hacks.
Fix locking of recv_sys and log_sys.(I must hve messed up a bunch of things).
redo_log_stream remove bunch of hacks.

sunbains commented 7 months ago

Regarding allocation and free, yes. You allocate the required heap up front. Then allocate the required memory as required from this heap. You free the entire heap once done.

sunbains / embedded-innodb

Add parallel redo recovery. #18