Open tennix opened 4 years ago
I want to join in this task~
@francis0407 and I also have interest in this issue.
Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.
Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.
What do you mean "without copying data in memory"?
I guess if streaming is working well then perhaps it isn't necessary to state this. But we should avoid copying wherever possible. The current implementation uses read_to_end which I presume copies the data every time it re-sizes its vector.
Another task that can be added to this is to ensure proper streaming to object storage without copying data in memory.
What do you mean "without copying data in memory"?
I think it is that uploading a file from disk without buffering all file data into memory.
I think it is that uploading a file from disk without buffering all file data into memory.
Even though it is an SST file, I think it is starting from in memory?
The writer interface on TiKV is currently
fn write(&self, name: &str, reader: &mut dyn Read) -> io::Result<()>;
so the cloud storage implementation decides how to read the SST file and upload. For local storage we use Rust's built-in std::io::copy which streams up to 8 KiB at a time.
I think it is that uploading a file from disk without buffering all file data into memory.
Even though it is an SST file, I think it is starting from in memory?
Just checked and the file content is in memory before being uploaded.
@yiwu-arbug when the file is created in memory, can it be created in a streaming fashion?
@gregwebs The SST writer is currently using an in-memory storage to speed up SST generation. This is the first copy.
We then write the content into a &mut Vec<u8>
buffer. This is the second copy.
The buffer is turned into a rate-limited reader, and passed into write()
. In tikv/tikv#6209, for simplicity, read_to_end()
to used to extract the entire content into &mut Vec<u8>
again. This is the third copy.
So in the current situation we will have to deal with 3N bytes at the time. The second and third copies can be eliminated by streaming, which reduces the memory usage to N bytes (assume buffer size ≪ N).
Typically N = region size = 96 MB, and the default concurrency is 4, so we're talking about using memory of ~1200 MB vs ~400 MB here.
@yiwu-arbug That said, for tikv/tikv#6209, the advantage of streaming over read_to_end()
isn't about the 800 MB of memory, but that the rate limit is really being respected.
Suppose we set the rate limit to 10 MB/s. With streaming, we will upload 1 MB every 0.1s, and the attain the average speed of 10 MB/s uniformly. With read_to_end()
, however, we will sleep for 9.6s, and then suddenly upload the entire 96 MB file at maximum speed. This doesn't sound like the desired behavior.
Yes, please get rid of read_to_end
! My understanding is that it will re-size its buffer, so that means copying memory around during the operation, slowing things down further.
From my quick looking at the TiKV source code to follow things back to the source, I see the SST file being generated by impl SstWriter for RocksSstWriter
using read_to_end
! I am not sure if this is the right code path. However, the point is that it does seem that we should be able to stream data out of RocksDB straight into S3 without buffering all the data and copying.
As you said, besides making memory usage predictable it also makes network usage predictable. Additionally, if there is more network bandwidth available then it will be possible to increase the concurrency, whereas if we are loading up memory and periodically maxing out network usage the backup will have to stay throttled. The backup operation can slow down the rest of the queries by delaying the GC, so the sooner it can finish the better.
BR support common cloud storage
Overview
Integrate BR with common cloud object storage (S3, GCS, Azure Blob storage etc).
Problem statement
Currently, BR supports local storage where backup files are stored on local directory. But the backup files need to be collected together and copied to every TiKV node. This is difficult to use in practice, so it's better to mount an NFS like filesystem to every TiKV node and BR node. However, mounting NFS to every node is difficult to set up and error-prone.
Alternatively, object storage is better for this scenario, especially that it's quite common to backup to S3/GCS on public cloud.
TODO list
S3 Support (2400 points)
GCS Support (1800 points)
TiDB Operator integration (900 points)
Test (2100 points)
Mentors
Recommended skills