Closed andydunstall closed 1 year ago
cc @romange adding this just to get feedback on using the C++ SDK - if you'd prefer to stick with a custom client I'll go back to trying to improve the reliability of that instead
We can continue with this and see if we can solve all the important issues. You mentioned CRT. What's that?
Regarding reading configuration/credentials files - I do not think it's a blocker. However using sleeps, mutexes and condition_variables could be problematic if it blocks for several milliseconds or more (like reading a file under a mutex or intentionally sleep for retry).
You mentioned CRT
Its 'common runtime' libraries - though I don't think we use it since we have out own HTTP client so can disable the background event loop (at least tracing the S3 requests I think its unused)
However using sleeps, mutexes and condition_variables could be problematic if it blocks for several milliseconds or more (like reading a file under a mutex or intentionally sleep for retry).
Yep, I can't see anything that would block for long except the sleep for a retry (which is part of the HTTP client so we'll override to fiber block)
Closing as implemented properly in https://github.com/romange/helio/pull/146
Draft attempt at adding the AWS C++ SDK to Helio. Adding just to get feedback on switching to the AWS SDK, more work is needed to integrate properly...
The main parts here are
util/cloud/aws/http_client.cc
andutil/cloud/aws/s3/write_file.cc
.This works uploading a 50GB snapshot from AWS in 2 minutes on a 16vCPU machine (~3GB per core) (can probably improve that, see 'tasks' below). (Dragonfly integration at https://github.com/dragonflydb/dragonfly/compare/main...andydunstall:dragonfly:feature/aws-sdk-poc)
Why
The alternative to using the AWS SDK would be to implement our own AWS client (as Helio already does), though AWS clients are so complex and there are so many edge cases that to implement a client properly would be a ton of work. AWS recommends avoiding using the REST API directly. Using an SDK would also make it easier for users to configure the Helio AWS client as it would match how they would configure any other SDK (which is again a lot of work to implement ourselves given theres so many configuration options)
To make the existing Helio AWS client reliable we would need:
Issues
Since we use fibers we can't block the active thread with blocking system calls, sleeps, synchronisation etc.
AWS supports using a custom HTTP client (instead of the default cURL HTTP client) so we can the Helio HTTP client which will block the fiber but not the thread.
However there are other blocking calls that aren't as simple to override, including:
std::condition_variable
andstd::mutex
To using the SDK isn't trivial, though still seems like its worth exploring since implementing our own client would be hard
Adding our own HTTP client gets us 90% of the way there, but we'll likely need to add a few patches to integrate properly, such as avoiding blocking the thread in retry backoff and instead only blocking the fiber, and loading config and credentials from a local file without blocking.
As long we we don't block the thread for a long time, such as reading a small file in under 1ms, rather than sleeps or network calls which could be seconds, and only block when an S3 upload/download is happening to avoid affecting 'normal' operation, maybe this is acceptable for now...
Extensions
HTTP Client
By default the SDK uses cURL on Linux though we can extend the SDK with our own Helio HTTP client which blocks the fiber rather than the thread
We'll need to improve the existing HTTP client to handle redirects and handle connections better (where the HTTP client manages the connections rather than requiring explicitly connecting by the user), though we'd probably have to do that anyway even without using the SDK
Note we must make sure we also override
RetryRequestSleep
, which puts the thread to sleep when backoff off on a retry (which could be 10s of seconds). Its part of the HTTP client though notvirtual
, so we would need a small patch to the SDK to override this and use fiber blocking instead of thread blockingLogging
By default the SDK logs to a file, though we can add our own logger that uses glog to match Dragonfly and Helio, and use the same glog configuration
Event Loop
By default the SDK starts a background event loop from aws-c-io, though its not that clear what its used for (something to do with the
CRTHttpClient
or async API requests), though since we have our own HTTP client we can disable thatFilesystem
The SDK uses synchronous file reads to read configuration and shared profile credentials. It doesn't seem to provide a way of extending the file calls.
Configuration is only read once on startup so probably isn't an issue, though shared profile credentials can be read once credentials expire. Its probably not a big issue though we should try to avoid it. The best option may be to fork the SDK to allow extending the filesystem reads/writes with out own Helio based API. Or we just accept a millisecond of thread blocking per day to reload credentials doesn't matter (and only when S3 is being used so won't affect 'normal' operation)... Or can disable loading credentials from the local filesystem (
~/.aws/credentials
) for now...S3
The SDK has a 'transfer manager', though that is built for uploading local files rather than our 'writer' interface, it also has some complex 'executor' event loop which we don't need. Therefore we can add our own uploader and downloader, similar to the existing Helio implementation, except we can use the SDK S3 client instead of making calls directly which makes the implementation relatively simple (ie use
s3::CreateMultipartUpload
,s3::UploadPart
ands3::CompleteMultipartUpload
)Tasks
This PR is just to get the SDK working though to make it 'production ready' we would need:
CMake
So far I've just build and installed the SDK locally (which is why CI fails), though we would need to integrate that into Helio CMake properly.
We should also minimise what we need to build, such as S3 only and avoid CRT if we can.
HTTP Client
The HTTP client needs work to be move reliable, including handling redirects (which is a requirement to use S3 as it may redirect to a different endpoint, see docs) and better connection management to avoid having to explicitly call
Connect
, but instead have the client connect as needed (and manage multiple connections). (Right now this PR just reconnects per request to avoid managing connections)We also need request and connection timeouts.
We also probably need to add HTTPS support. There is a HTTPS client in Helio though its currently separate from the HTTP client. Maybe the HTTP client factory can just return the correct client as needed...
File System
As described above, the SDK may make blocking system calls to read the local credentials and config. It will only load config once on startup so thats not an issue, though may reload local credentials again when they expire.
Given its likely less than 1ms to read the file and only happens when an S3 upload/download occurs so won't affect 'normal' operation, we may accept this for now. Or disable loading credentials from the local filesystem (so only support environment variables and EC2 metadata).
Credentials
This PR so far only supports loading credentials from the environment. We need to support EC2 metadata, which should be fine as it should use the custom Helio HTTP client. We should to disable default credential providers like 'process', which forks and runs a separate process and pipes back the credentials or something.
Optimise
Right now just to get something working as easily as possible, the HTTP client does more copies than it needs to when constructing a boost request from an AWS request, and an AWS response from a boost response, which should be avoidable.
We can benchmark the performance and resource usage using the SDK compared to the existing Helio implementation.
Can also look at tuning the part size etc.
(Running PR so far takes 2 minutes to take a 50GB snapshot on an
m5.4xlarge
(16 vCPU, 60GB memory))Testing
Since we're using the SDK we won't need to test too much, though should at least verify reconnects and retries work as expected, and the different supported credential providers work.
Verify Non-Blocking
We a final check should verify the performance of Dragonfly is unaffected when no S3 operations are running, and compare the performance of Dragonfly while S3 uploads are happening
Download Support
So far only added upload support, not download
Dragonfly Integration
For testing added a hacky integration though this needs to be done properly