Parallel parts upload - Githubissues

lschweiss commented 11 years ago

Let me start by saying awesome piece of work. Saved me a lot of coding.

Uploading to glacier I'm getting about 1.2 MBytes/S. My limit is the single TCP stream. Using utilities like bbcp (http://www.slac.stanford.edu/~abh/bbcp/), I can push 33MBytes/S in 50 parallel streams to an EC2 instance.

By buffering or caching a configurable number of parts, parts can be sent in parallel.

wvmarle commented 11 years ago

I'm sure it can be done but complicates especially the upload resumption (it is not guaranteed any more that finished parts are consecutive).

Upload from stdin in multiple parallel sessions may be really tricky to accomplish, as it'd require buffering data to tmp file first. Resumption is getting even trickier in that case as stdin can be read sequentially only.

To improve speed, try increasing your part size, that may help a lot. I've found that a few commits back (see issue #71) download speeds suddenly went down the drain, particularly for small blocks. I haven't figured out why, it appears that the time to start up sending of the next block is long. So less and bigger blocks drastically improve upload speeds. Try 32 MB or 128 MB; sounds like you've got a pretty fast connection so can quickly see what happens.

I see vast differences between small and large blocks: <300 kB/s on 1MB blocks, 1.5-1.7 MB/s on 32M blocks (at a 20 Mbit connection). And looking at actual transfers with iftop my pipe is being saturated, it's the delay between blocks that's keeping overall speeds down.

lschweiss commented 11 years ago

Sorry, but I'm just getting familiar with the code. Are you talking about part size or a block size that is hard coded? I've been testing with 128 MB part size.

Right now my use may be a bit of a corner case. I'm working on pushing many TBs of medical imaging research data to Glacier and 1.6 MB/s is now the fastest I've seen with this code. That math doesn't work out very nice.

We've been through extensive network optimizations and are limited by TCP itself in this case. We have a 2 Gb/s connection to the net and will have 10 Gb/s next year. Using UDT we've pushed 1 Gb/s across the Atlantic only limited by the lan connection at the other end.

I suggest the parallel parts because I've seen great improvements (30x) pushing streams to EC2 instances using bbcp which open many parallel TCP streams.

I'd dive in myself and work on this, but my experience in Python is close to 0. Doesn't mean I won't eventually dive in and learn the language.

For now I will try using bbcp to an EC2 instance that will run glacier-cmd. The network proximity should greatly speed this up.

As time passes and Glacier's use grows, I'm sure many other big data operations will be looking for this functionality.

wvmarle commented 11 years ago

Part size as in the size of chunks of data uploaded to Glacier. At 128 MB that shouldn't be the issue indeed. Any speed throttling going on?

Getting parallel connections working is an interesting challenge, no idea at the moment how to go about it. I guess have to go multi-threaded, assigning parts to upload to various independent upload threads. Hardest is to keep track of progress of all those separate uploads.

wvmarle commented 11 years ago

Started work on the parallel uploads - indeed have to go for multiprocessing, making progress updates a lot trickier.

wvmarle commented 11 years ago

parallel uploading works - now resume function is broken... and progress updates need serious work, not sure how I can get immediate feed-back from the worker processes yet, or another way to get a reasonably accurate transfer speed calculation. Interesting stuff.

lschweiss commented 11 years ago

Awesome. I'd be happy to test your code at any time.

wvmarle commented 11 years ago

I've got it working; posting code soon. The --stdin upload has an issue, that I have to hunt down. It doesn't work properly.

There is a big issue: very often I get a signature error from Glacier on my processes. That means the process that gets this error, dies. It also causes the script to hang, as it waits for all processes to finish their work, which of course is never going to happen because some crashed.

Why I get this, I don't know. I have the feeling it's Glacier that's making problems. It's also not only in these parallel sessions, recently I quite often get these errors while running normal tests, or doing my normal uploads. Maybe Glacier doesn't like me much with all those aborted multipart uploads...

Each session creates their own unique connection to Glacier, then starts uploading using the UploadId of the archive in question. This is how it's supposed to be done - and how it mostly works, but sometimes not. Use --resume (if you use bookkeeping) or --upload to restart your upload and all the missing parts should be filled in nicely.

wvmarle commented 11 years ago

Code here:

https://github.com/wvmarle/amazon-glacier-cmd-interface/tree/parallel_uploads

Includes the two open pull requests (for download resumption and upload resumption). Note: the documentation doesn't have these extra functions mentioned yet; use the --help switch to get a list of all available command line options.

Please let me know if there are any issues!

gawbul commented 10 years ago

@wvmarle I'm very interested in this! How mature is this code now? I'm happy to test stuff! The uni I work for have a 10Gbps link I can test pushing a few TBs of data to Glacier with ;)

@lschweiss, thanks for the heads up on bbcp too. Any other useful parallel transfer tools you're aware of that are compatible with Glacier?

uskudnik / amazon-glacier-cmd-interface

Parallel parts upload #87