tennix commented 4 years ago

BR support common cloud storage

Overview

Integrate BR with common cloud object storage (S3, GCS, Azure Blob storage etc).

Problem statement

Currently, BR supports local storage where backup files are stored on local directory. But the backup files need to be collected together and copied to every TiKV node. This is difficult to use in practice, so it's better to mount an NFS like filesystem to every TiKV node and BR node. However, mounting NFS to every node is difficult to set up and error-prone.

Alternatively, object storage is better for this scenario, especially that it's quite common to backup to S3/GCS on public cloud.

TODO list

S3 Support (2400 points)

[x] Extend current backup protobuf to support S3 compatible (S3, Ceph object storage, Minio, Aliyun OSS) storage (900 points / medium) @tennix https://github.com/pingcap/kvproto/pull/507
[x] Store BR metadata to S3 compatible object storage (600 points / easy) @DanielZhangQD https://github.com/pingcap/br/pull/103
[x] Store TiKV SST files to S3 compatible object storage (900 points / medium) @yiwu-arbug https://github.com/tikv/tikv/pull/6209
GCS Support (1800 points)
[x] Extend backup protobuf to support GCS (300 points / easy) @SunRunAway https://github.com/pingcap/kvproto/pull/514
[x] Store BR metadata to GCS (600 points / easy) @SunRunAway https://github.com/pingcap/br/pull/102
[x] Store TiKV SST files to GCS (900 points / medium) @yiwu-arbug https://github.com/tikv/tikv/pull/6283
TiDB Operator integration (900 points)
[x] Integrate BR with tidb-operator to allow easy deployment (900 points / medium) @DanielZhangQD https://github.com/pingcap/tidb-operator/pull/1280

Test (2100 points)

[ ] Unit test coverage >= 60% (900 points / easy)
[ ] e2e test (1200 points / medium) https://github.com/pingcap/br/pull/108

Mentors

@tennix
@overvenus
@kennytm

Recommended skills

Go language
Rust language
Familiar with S3/GCS/Azure blob storage

yiwu-arbug commented 4 years ago

I want to join in this task~

SunRunAway commented 4 years ago

@francis0407 and I also have interest in this issue.

gregwebs commented 4 years ago

Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.

tennix commented 4 years ago

Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.

What do you mean "without copying data in memory"?

gregwebs commented 4 years ago

I guess if streaming is working well then perhaps it isn't necessary to state this. But we should avoid copying wherever possible. The current implementation uses read_to_end which I presume copies the data every time it re-sizes its vector.

SunRunAway commented 4 years ago

Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.

What do you mean "without copying data in memory"?

I think it is that uploading a file from disk without buffering all file data into memory.

gregwebs commented 4 years ago

I think it is that uploading a file from disk without buffering all file data into memory.

Even though it is an SST file, I think it is starting from in memory?

kennytm commented 4 years ago

The writer interface on TiKV is currently

    fn write(&self, name: &str, reader: &mut dyn Read) -> io::Result<()>;

so the cloud storage implementation decides how to read the SST file and upload. For local storage we use Rust's built-in std::io::copy which streams up to 8 KiB at a time.

yiwu-arbug commented 4 years ago

I think it is that uploading a file from disk without buffering all file data into memory.

Even though it is an SST file, I think it is starting from in memory?

Just checked and the file content is in memory before being uploaded.

gregwebs commented 4 years ago

@yiwu-arbug when the file is created in memory, can it be created in a streaming fashion?

kennytm commented 4 years ago

@gregwebs The SST writer is currently using an in-memory storage to speed up SST generation. This is the first copy.

https://github.com/tikv/tikv/blob/b147b38eeaae4129e7e3eb7c06e270b8e0682697/components/backup/src/writer.rs#L108-L117

We then write the content into a &mut Vec<u8> buffer. This is the second copy.

https://github.com/tikv/tikv/blob/b147b38eeaae4129e7e3eb7c06e270b8e0682697/components/engine_rocks/src/sst.rs#L209-L212

The buffer is turned into a rate-limited reader, and passed into write(). In tikv/tikv#6209, for simplicity, read_to_end() to used to extract the entire content into &mut Vec<u8> again. This is the third copy.

https://github.com/tikv/tikv/blob/b826af388050fc48de2ce36a7684d1309d00e83a/components/external_storage/src/s3.rs#L94

So in the current situation we will have to deal with 3N bytes at the time. The second and third copies can be eliminated by streaming, which reduces the memory usage to N bytes (assume buffer size ≪ N).

Typically N = region size = 96 MB, and the default concurrency is 4, so we're talking about using memory of ~1200 MB vs ~400 MB here.

@yiwu-arbug That said, for tikv/tikv#6209, the advantage of streaming over read_to_end() isn't about the 800 MB of memory, but that the rate limit is really being respected.

Suppose we set the rate limit to 10 MB/s. With streaming, we will upload 1 MB every 0.1s, and the attain the average speed of 10 MB/s uniformly. With read_to_end(), however, we will sleep for 9.6s, and then suddenly upload the entire 96 MB file at maximum speed. This doesn't sound like the desired behavior.

gregwebs commented 4 years ago

Yes, please get rid of read_to_end! My understanding is that it will re-size its buffer, so that means copying memory around during the operation, slowing things down further. From my quick looking at the TiKV source code to follow things back to the source, I see the SST file being generated by impl SstWriter for RocksSstWriter using read_to_end! I am not sure if this is the right code path. However, the point is that it does seem that we should be able to stream data out of RocksDB straight into S3 without buffering all the data and copying. As you said, besides making memory usage predictable it also makes network usage predictable. Additionally, if there is more network bandwidth available then it will be possible to increase the concurrency, whereas if we are loading up memory and periodically maxing out network usage the backup will have to stay throttled. The backup operation can slow down the rest of the queries by delaying the GC, so the sooner it can finish the better.

pingcap / br

Backup to common cloud storage #89

BR support common cloud storage

Overview

Problem statement

TODO list

S3 Support (2400 points)

GCS Support (1800 points)

TiDB Operator integration (900 points)

Test (2100 points)

Mentors

Recommended skills