go-bond / bond

Other
10 stars 3 forks source link

use dedicated batch for index keys #57

Closed poonai closed 2 years ago

poonai commented 2 years ago

What changed made ?

This PR track all the index keys in a separate batch and then applies the index keys at the later stage.

Rationale for the change

Index keys are not required for checking the record duplication.

Note

indexKeyBatch are lightweight since they're inserted back into Pool while closing.

pkieltyka commented 2 years ago

Thanks for the PR :) what do the benchmarks look like before and after? (Assuming a data set of at least 5GB)

poonai commented 2 years ago

regarding the benchmark:

If we do a sorted insert, then we don't find anything big. Because the pebble itself is optimized enough to skip through sorted keys.

we have to run it on some realistic workload, to see whether it makes sense to have this optimization.

I can create some dataset that has randomized keys to benchmark or do we have any dataset which I could probably use to run the benchmark?

pkieltyka commented 2 years ago

We have another app consuming the data set but I think writing a simple tool to produce random data just for bond testing is a good idea

poonai commented 2 years ago

Thanks, I’ll come up with the tool :)

On Sat, 29 Oct 2022 at 6:47 PM, Peter Kieltyka @.***> wrote:

We have another app consuming the data set but I think writing a simple tool to produce random data just for bond testing is a good idea

— Reply to this email directly, view it on GitHub https://github.com/go-bond/bond/pull/57#issuecomment-1295834384, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMMCFGZQHW5QNNYUJYERRTWFUPXJANCNFSM6AAAAAARRVPVJE . You are receiving this because you authored the thread.Message ID: @.***>

poonai commented 2 years ago

notify: @pkieltyka

I've implemented a benchmarking tool for long-running jobs as per your suggestion. (tried to mimic ycsb)

Here are the results preliminary results:

master


Total time taken to insert 19m40.066199548s
size of database 753 MB

Total time taken to insert 19m49.239178586s
size of database 763 MB

poonai:poonai/insert_batch_seperation

Total time taken to insert 17m12.718217997s
size of database 752 MB

Total time taken to insert 17m8.408880599s
size of database 746 MB

Assuming a data set of at least 5GB

I didn't run for 5 GB since it takes a lot of time on my small machine.

But, I'm curious to stress the machine myself. Will update here, once it done 😄

pkieltyka commented 2 years ago

nicely done, so a ~13% improvement :)

do you think if we sort indexes before we write the to the index batch will make a difference..? or should pebble already be sorting keys during the batch write?

poonai commented 2 years ago

pebble will do the sorting internally.

marino39 commented 2 years ago

@poonai Hi, Nice to meet you.

I really like the benchmarking tool that you have prepared. I haven't expected that big gain that's why I have run your benchmark too.

master

Total time taken to insert 17m48.307964933s 
size of database 756 MB

Total time taken to insert 17m28.721275263s 
size of database 755 MB

Total time taken to insert 17m38.046745363s 
size of database 763 MB 

poonai:poonai/insert_batch_seperation

Total time taken to insert 17m22.238643583s 
size of database 750 MB 

Total time taken to insert 17m24.658270013s 
size of database 757 MB 

Total time taken to insert 17m29.262544851s 
size of database 759 MB 

It's still better, however the difference isn't that big anymore. I have made sure that nothing else is running on my os at that time. Benchmarking sometimes can be tricky.

We would like to keep this change for a few reasons:

  1. It's faster.
  2. We would like to separate data and indexing.
  3. It will help us abstract indexer logic later on and potentially offer different indexing strategies.

In order to keep indexing logic consistent across the board: Could you apply the same thing to Update, Upsert, and Delete? In addition to that please move benchmarking tool to cmd/tools/<name_of_the_tool>

Thank you :)

poonai commented 2 years ago

@marino39

Peter mentioned that you are the core contributor to Bond. It's pleasure meeting you!

Yes, benchmarking is tricky. That's why I ran multiple times, turned out that my machine punished me :(

I've updated the pattern to Update and Upsert and also found that it's not required for Delete

Thanks for accepting the changes.