S3-Store: Slow transfer-speed in some environments

runar-indico commented 4 years ago

Question For a customer running Netapp StorageGrid as the s3-solution, we see slow transfer-speeds, around 2.5 Mb/sec. Same speed both for chunked uploads, as well as non-chunked uploads.

This is within the local network, so we would expect a lot faster transfers. Other applications running in the same network and using the same s3-bucket achieve the expected speeds.

In other environments, we see great speed with tusd, so I am not really sure what the problem might be.

Do you have any tips for how we might go about debugging the slow transfer-speeds? Or any hunches about what might be the cause?

In #344 it was mentioned that there were some ideas as to how to increase the efficiency of the transfers, without going into any details.

If you are able to share some of the ideas, I might be able put in some time to implement them and contribute to this project.

Setup details

Please provide following details, if applicable to your situation:

Operating System: Container-less destribution (linux), Kubernetes-environment.
Used tusd version: 1.1.0
Used tusd data storage: S3-compatible (specifically a NetApp StorageGrid

Used tusd configuration: custom-build, but here is the relevant s3-config used:

s3 := s3store.S3Store{
    Bucket:            bucket,
    Service:           service,
    ObjectPrefix:      objectPrefix,
    MaxPartSize:       5 * 1024 * 1024 * 1024,
    MinPartSize:       5 * 1024 * 1024,
    MaxMultipartParts: 10000,
    MaxObjectSize:     5 * 1024 * 1024 * 1024 * 1024,
    BaseConfig:        base,
}

Acconut commented 4 years ago

Thanks for bringing this up. Apologies for the delay as it took me some time to write this together.

Other applications running in the same network and using the same s3-bucket achieve the expected speeds.

Just as a reference, what are the "expected speeds"?

In other environments, we see great speed with tusd, so I am not really sure what the problem might be.

What do you mean with "other environments"? Do they use different storage backends?

Do you have any tips for how we might go about debugging the slow transfer-speeds? Or any hunches about what might be the cause?

Yes, indeed. Before describing these, I just want to quickly go over the current S3Store implementation: Basically, we always use S3 Multipart Uploads for transferring the data from tusd to S3. This means that tusd breaks the incoming data stream into multiple so called parts, each of which is uploaded to S3 individually. Once all parts have been uploaded, we tell S3 to concatenate all those parts together into a single file (S3 calls that "completing the multipart upload"). This strategy is nice because it allows to partially upload a file to S3 and later resume the upload. That enables the resuming uploads that tusd is all about.

You may already see a problem with this. The current S3Store implementation aims that every part is 5MB in size (the minimum part size for AWS S3). Since uploading a single part takes an additional HTTP request, it should be in our interest to reduce the number of parts we need to upload to S3. I can think of following improvements for upload speed right now:

Do not use Multipart Uploads for small files: If you have a small file (say less than 5MB), tusd will issue three HTTP requests to transfer it to S3: Create a multipart upload, upload the single part, complete the multipart upload. However, for those small files it may be faster to use the traditional S3 "put object" method to upload the entire file in one go. Of course, we still have to talk about resumability in that case.
Increase the part size: tusd has a method to calculate the optimal part size (please read through https://github.com/tus/tusd/blob/master/pkg/s3store/s3store.go#L864) but for files smaller than 50GB, it will always use a part size of 5MB. So if you upload a 500MB file, it will issue 100 HTTP requests just for the parts alone. Maybe for these situations is would be better to allow the part size to get bigger, so less requests are needed.
Upload data to S3 while receiving data from client: Right now, the S3Store will execute following loop for uploading data: Read ~5MB from the client, save it to disk and upload that part to S3 (see the loop in https://github.com/tus/tusd/blob/master/pkg/s3store/s3store.go#L318-L376). However, these actions are done sequentially, so whenever we upload a part to S3, no data is received from the client (the kernel will buffer some data for us but that's not a lot). So I assume that reading data and uploading it in parallel would bring a nice speed benefit for us.

Do these ideas make sense to you or should I explain something in more detail? Of course, all of these items are just assumption and their speed impact must be measured before releasing and recommending them.

runar-indico commented 4 years ago

Just as a reference, what are the "expected speeds"?

I can only assume around Gigabit-transfer-speeds, possibly limited by disk I/O. I don't have any exact numbers, since I cannot access the network myself. But it is a local enterprice-network.

What do you mean with "other environments"? Do they use different storage backends?

Other environments would be other installations, on other networks. Currently, we have just used Minio and S3.

Thank you very much for your suggestions, greatly appreciated.

Do not use Multipart Uploads for small files [...]

Did not seem to have any noticeable effects.

Increase the part size.

Did not seem to have any noticeable effects.

Upload data to S3 while receiving data from client [...]

This sounds like something that we could run a few benchmarks on. I wonder if this might be the case for the customer. The customer runs Kubernetes, but has not assigned a storage for these temporary files. Maybe the files would then be stored on a very slow disk, or off-site somewhere.

I'll try to limit the disk-IO here locally, to simulate such a situation. I'll then try to see if I can this improvement working and a PR for it.

Acconut commented 4 years ago

Did not seem to have any noticeable effects.

Can you share more details on how you tested this hypothesis and came to the conclusion that it does not have noticeable effects?

Maybe the files would then be stored on a very slow disk, or off-site somewhere.

I don't think disk IO is a problem. Regardless of your disk performance, there is a time during the upload where tusd does not accept data from the client because tusd transfers chunks to S3. If you have such pauses, your upload speed will degrade. The problem is not a slow disk but the sequential loop where reading data from the client and transferring it to S3 are not occurring in parallel. Do you understand what I mean?

runar-indico commented 4 years ago

Can you share more details on how you tested this hypothesis and came to the conclusion that it does not have noticeable effects?

The customer did a simple test for us, uploading first a 1.5 gb file, where chunking was set to 5 MiB. Then upload the same file, but without using chunking at all. The transfer-speed for both cases was about the same.

. The problem is not a slow disk but the sequential loop where reading data from the client and transferring it to S3 are not occurring in parallel. Do you understand what I mean?

Yes, I believe so. I understand that it would essentially switch between receiving the chunk from the client, and then transfer that chunk to s3 before continuing, and this might cause a pause.

I was thinking that if reading and writing to the temporary file was slow, this might increase the pause-time. I'll check the logs from the customer to see if there are any noticeable pauses between the chunks.

Acconut commented 4 years ago

The customer did a simple test for us, uploading first a 1.5 gb file, where chunking was set to 5 MiB. Then upload the same file, but without using chunking at all. The transfer-speed for both cases was about the same.

I was not referring to the chunk size settings on the client but instead talking about tusd's internal chunk handling before uploading to S3. You cannot test this hypothesis without modifying tusd's code since this internal chunking in tusd cannot be currently disabled.

I was thinking that if reading and writing to the temporary file was slow, this might increase the pause-time. I'll check the logs from the customer to see if there are any noticeable pauses between the chunks.

Yes, if the temporary file system is very slow it will cause delays but I doubt that disk IO makes significant contribution to upload performance.

runar-indico commented 4 years ago

I was not referring to the chunk size settings on the client but instead talking about tusd's internal chunk handling before uploading to S3

Oh, sorry. I misunderstood.

I could run some tests with

Increase s3Store.MinPartSize to say 5gb for now. Ignoring resumability for now.
Check if the current chunk is smaller than s3Store.MinPartSize, and if it is, use PutObjectWithContext.

Does this sound reasonable?

Acconut commented 4 years ago

Yes, exactly. You understood it correctly now :)

ivanpk commented 4 years ago

I could run some tests with

1. Increase `s3Store.MinPartSize` to say 5gb for now. Ignoring resumability for now.

2. Check if the current chunk is smaller than `s3Store.MinPartSize`, and if it is, use PutObjectWithContext.

Does this sound reasonable?

@runar-indico , did you ever have a chance to run these tests? What were the findings, if so?

runar-indico commented 4 years ago

Sorry, I’ve been very busy lately, and have not gotten around to it.

acj commented 4 years ago

We've also noticed unpredictable latency when tusd calls the AWS S3 API, going back to our initial deployment of tusd in 2018, and it seems to be the root cause of these performance issues. Modifying tusd to measure and log the latency per call showed that it can take anywhere from ~60ms to ~5000ms (!) for a 5MB part, but typically in the 100-300ms range. This is for tusd instances running in a Kubernetes cluster with Canal networking. The cluster is built using EC2 instances in AWS's us-east-1 region, and tusd makes direct calls to the S3 service. I've confirmed that the latency profile is the same when tusd is deployed with a vanilla config on "bare" EC2 instances without Kubernetes. If other S3-compatible services have similar latency, then I would expect to see similar throughput degradation.

Upload data to S3 while receiving data from client: Right now, the S3Store will execute following loop for uploading data: Read ~5MB from the client, save it to disk and upload that part to S3 (see the loop in https://github.com/tus/tusd/blob/master/pkg/s3store/s3store.go#L318-L376). However, these actions are done sequentially, so whenever we upload a part to S3, no data is received from the client (the kernel will buffer some data for us but that's not a lot). So I assume that reading data and uploading it in parallel would bring a nice speed benefit for us.

Yes, reading and uploading in parallel can significantly improve upload performance for tusd deployments that use S3Store. We've experimented with a few approaches to uploading parts in parallel -- initially, using goroutines in a naive way to offload the UploadPartWithContext call and allow the request goroutine to continue receiving data from the client. This is risky and can make an upload fail in some scenarios, but it was enough to demonstrate that clients' upload throughput improved significantly. A teammate on a residential fiber connection saw an increase from ~25mbps to ~100mbps.

More recently, we've implemented a version of WriteChunk that uses a producer-consumer pattern (decoupling client reads from S3 part uploads, with a configurable buffer) like you mentioned. It hasn't been production-tested yet due to other priorities, but I would be willing to open a draft PR in the meantime so that we can test and iterate on it together. What do you think, @Acconut?

Acconut commented 4 years ago

Apologies for the late reply, @acj! Thank you very much for the insights and help!

More recently, we've implemented a version of WriteChunk that uses a producer-consumer pattern (decoupling client reads from S3 part uploads, with a configurable buffer) like you mentioned.

That's amazing!

It hasn't been production-tested yet due to other priorities

We have the master.tus.io instance which can also be used for a semi-production test.

I would be willing to open a draft PR in the meantime so that we can test and iterate on it together. What do you think, @Acconut?

That's a good plan! I am very happy to assist you with it!

segevfiner commented 4 years ago

A lot of the performance of S3 for larger uploads (When bandwidth/disk speeds allow) is dependent on uploading multiple parts of the file in parallel. As it stands with the current TUS protocol, parallelization can only be achieved server side by buffering the upload, the client can't upload multiple parts in parallel to the same upload. The concatenation extension is about concatenating multiple disjoint uploads as far as I can tell, which is something GCS supports fast, but in S3 is just a server side copy and concatenation of the entire uploaded data. How much a vanilla multi-part upload via the current upload code in tusd compares is something that will have to be measured to figure out how much of an impact this makes, taking network conditions and disk speeds into account of course.

P.S. I previously had to abandon tus/tusd due to performance issues. One of them is the current requirement of uploading to a temporary file and than having to move the upload to it's final destination. That's totally acceptable for local FS, but is not how you are supposed to use S3 and is unacceptably slow on it (Often times 100% slower, doubling the time it takes for an upload), instead there it is expected to relay on the fact that an upload is only committed to S3 once it is fully finished (The single upload is done or the multi part upload is finished by a specific API request). Unlike the issue above, I believe that's not actually a requirement of the TUS protocol, but rather just a shortcoming of the tusd implementation.

Acconut commented 4 years ago

I am sorry to read that you were having issues with tusd due to poor performance. Did I understand your comment correctly that you were saying there is a better way to implement the Concatenation extension on S3?

segevfiner commented 4 years ago

I'm not actually sure if that is possible with the existing concatenation extension.

The way you are supposed to do concurrent uploads of a single object in standard S3 is to create a single multi-part upload, and upload separate parts in parallel. (I think you can do the same in other clouds object stores, but not sure).

The way the current concatenation extension works with S3 is creating separate uploads and then stitching them all in the end using a server side copy (UploadPartCopy if I'm not mistaken). That is significantly slower than just a single multi-part upload.

GCP does include the ability to concatenate objects server side fast, so a concatenation extension like the current one might be useful there, but that's only for GCP. (I'm not sure if it has a different variation of concurrent uploads then using the concatenation API).

Acconut commented 4 years ago

Ah, ok, thank you for the explanation. It now makes sense to me what you meant. The underlying problem is that the tus server does not know in advance which tus upload will be concatenated, so we cannot take full advantage of AWS S3 performance characteristics. Maybe we need a better approach in the tus protocol in the future.

segevfiner commented 4 years ago

Multi part uploads similar in nature to S3 are also supported by the GCS resumable upload API: https://cloud.google.com/storage/docs/performing-resumable-uploads. I'm not sure whether the SDKs or CLIs utilize them by default.

Azure Storage block blobs are divided into blocks that are uploaded and committed separately in the first place. Not sure that API is usable with the existing concatenation extension either, as those blocks are associated with a specific blob, unlike the concatenation extension of TUS or the concatenation API of the GCP storage service, that is about concatenating separate objects/blobs.

tus / tusd

S3-Store: Slow transfer-speed in some environments #379