rockset / rocksdb-cloud

A library that provides an embeddable, persistent key-value store for fast storage optimized for AWS
http://rocksdb.org
GNU General Public License v2.0
759 stars 119 forks source link

how to use zero copy #83

Open nnsgmsone opened 4 years ago

nnsgmsone commented 4 years ago

How do I use zero copy? Does the so-called zero copy share s3 storage or not share it? I see ppts introduction of both situations, very confused. Hope to answer.

2020-04-01 14-44-09 的屏幕截图

webwxgetmsgimg

dhruba commented 4 years ago

zero-co[y clones do share the S3 storage. But once the clone is created, the new files that it generates will be at a new location in S3.

Maybe u can look at the example of a zero-copy-clone here https://github.com/rockset/rocksdb-cloud/blob/master/cloud/examples/clone_example.cc#L32

nnsgmsone commented 4 years ago

@dhruba How else to ensure consistency through wal, do I need to manually build a checkpoint and then the new node recovers from the checkpoint

dhruba commented 4 years ago

If you store the wal in Kinesis and the sst files in S3, then when you reopen the db on a different machine, you can replay the WAL. For example, if the WAL is on kafka, then you can follow this pattern: https://github.com/rockset/rocksdb-cloud/blob/master/cloud/db_cloud_test.cc#L797

If wal is on kinesis: https://github.com/rockset/rocksdb-cloud/blob/master/cloud/db_cloud_test.cc#L836

Just an FYI: for our Rockset's use-case, we switch OFF the rocksdb-cloud's WAL

nnsgmsone commented 4 years ago

@dhruba Well, there is a problem. If writing occurs during clone, should I block the writing or generate a checkpoint myself? I see the example just to flash all sst to s3.

nnsgmsone commented 4 years ago

In other words, i should control the timing of replay wal?

nnsgmsone commented 4 years ago

If there is no checkpoint, then writes appearing during the clone seem to cause data inconsistency between the two nodes.

dhruba commented 4 years ago

In other words, i should control the timing of replay wal?

Yes, that makes sense. Let me know if this works for you.

nnsgmsone commented 4 years ago

ok

b-slim commented 4 years ago

@dhruba newbie question, when you say replay the wal from Kafka does that means you replay the wall for the start of time ? if not, how the replica knows about the highest watermark aka Kafka offset ? To be more clear the use case am trying to tackle is to be able to migrate a rocks Db instance on failure and trying to avoid replaying the log from the start of time because that takes hours. Sorry am still trying to get familiar with the rocksDb code base.

dhruba commented 4 years ago

Hi @b-slim, thanks for your question.

You can use the zero-copy clone this way. Suppose you have a rocksdb-cloud database D1 that has a WAL and uploads it sst files to S3. Lets say that the WAL is in kafka.

Now, if you want to make a zero-copy clone you will do these steps in this order:

  1. Record the latest sequence number S1 from the kafka wal. New writes can continue to happen to D1
  2. Create the zero-copy clone C1 using the rocksdb-cloud apis. Then apply to C1 all the WAL entries from the beginning of the kafka log to S1. Once this is complete. you have a clone C1 that is completely in sync with the version of the database D1 as of the time when the clone was created.

Now both C1 and D1 and normal rocksb-cloud database and are not related to one another. Writes done to one do not show up in the other, which is the expected behaviour.

Some of the sst S3 files are still shared, and you have to be careful to ensure that they do not get erased prematurely. Ensure that purger is enabled https://github.com/rockset/rocksdb-cloud/blob/master/include/rocksdb/cloud/cloud_env_options.h#L245 and disable file deletions via https://github.com/rockset/rocksdb-cloud/blob/master/include/rocksdb/db.h#L1095. Just a caveat that we, at Rockset, do not run the purger in our production cluster. If you find bugs in the workings of the purger code, pl do submit a pull request.

nnsgmsone commented 4 years ago

@dhruba By the way, the above method should cause the merge of the lsm tree to fail. Assuming whether merge has occurred on the clone, what is the processing flow of rockdbcloud at this time?Because I am not very familiar with cpp, I did not find the code to handle this situation. . In addition, I think it is possible to introduce checkpoint to ensure data consistency (although this will change more code).

dhruba commented 4 years ago

@nnsgmsone I do not understand this "By the way, the above method should cause the merge of the lsm tree to fail. "