Add support for interrupted multi-part uploads

gburca commented 11 years ago

Currently if a multi-part upload fails for some reason, there's no way to continue uploading from where the previous upload left off. That's a problem for large archives.

In order to resume multipart uploads, the script would need to:

Allow the user to get the uploadID for unfinished uploads
Allow the user to pass the uploadID as an optional parameter to the "upload" subcommand
Figure out (and transmit) the missing pieces when "upload" is called with an uploadID
Allow the user to abort an unfinished upload

wvmarle commented 11 years ago

Very important one indeed. I suggest to store progress data in a SimpleDB. Basically what you need to store there is the file name, block size used and number of blocks successfully uploaded, and whether upload is complete. This also allows for automatic resumption, if user restarts the upload it has to check whether we tried to upload this file already, if so the progress, and start from there.

Related: how about have upload check the bookkeeping db whether this file (check for identical name/byte size - should be good enough, maybe hash to make sure but that takes really long for large files) is uploaded already.

If identical file exists: print warning, exit. Having multiple files with same name/size usually isn't very useful.
Add a --force switch to override this check and upload the file after all.
Add a --replace switch to force upload and removal of the old one.

wvmarle commented 11 years ago

Let me elaborate a little on my idea:

start upload file procedure .
check whether file with same file name and total size has entry in the bookkeeping db.
1. It has none: create entry in bookkeeping db: file name, total size, chunks uploaded = 0, ...; chunk_count = 0
2. It has: fetch number of uploaded blocks chunk_count from db.
take next chunk of file (by chunk_count)
upload chunk, check return hash for success
if success: update entry in db, increase block size by 1, goto 3.
upload finished; record this in db.

The occurrence of file of same size and same name being different is I think small enough to ask user to take care of that themselves, and use the --replace option for that situation.

gburca commented 11 years ago

We need to keep in mind 2 overall requirements for this feature:

It should work without SimpleDB since that's an optional portion and not everyone will have it enabled.
It should work with data coming from STDIN as well as from a file.

Your design has hard dependencies on either the database being there, or on a file being used.

I've outlined at the top a few simple steps required for this feature:

Allow the user to get the uploadID for unfinished uploads
Allow the user to pass the uploadID as an optional parameter to the "upload" subcommand
Figure out (and transmit) the missing pieces when "upload" is called with an uploadID
Allow the user to abort an unfinished upload

I've already added the listmultiparts subcommand. That takes care of #1. I've also added abortmultipart which takes care of #4. We still need to do #2. If the upload subcommand sees an uploadID, it means the user is resuming an upload, so we will need to:

List the parts of an in-progress multipart upload. There's an API for getting that information straight from Glacier. It takes as an input the uploadID provided by the user. No need to depend on SImpleDB. See: http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-multipart-list-parts.html
Figure out the missing parts, and send them out.

I would shy away from making any assumptions such as the filename being unique (or present), or of there being no need to upload multiple archives with the same name, etc... Unless the underlying service (AWS Glacier) imposes a limitation, the tool should not impose its own arbitrary ones.

gburca commented 11 years ago

For error handling, there's no need to validate the uploadID since we'll get a failure from AWS if it's invalid and can pass that straight to the user.

We should however probably use the SHA256's returned by AWS when we get the list of parts that are already uploaded, and verify that they match the data the user is providing for the resumption attempt. Keep in mind you can't randomly seek STDIN, so if AWS says it has parts 1, 2, 5, 6, there's no way to verify the SHA256 for parts 5 and 6 before uploading parts 3 and 4 (or saving them to some temporary location?) because you have no way of seeking backwards.

wvmarle commented 11 years ago

For the multipart resume, a few things.

currently upload sends parts sequentially, one part at a time. Glacier should be able to handle multiple parts of the same file at the same time (may help to speed up uploads by using multiple connections at the same time), currently glacier-cmd does not do this. So missing intermediate parts shouldn't happen.
when reading from stdin you're indeed limited to reading sequentially only, doesn't matter. Just read the parts and do what you need to do with the parts. So read data for part 1, check hash. Part 2, check hash. Part 3, upload. Part 4, upload. Part 5, check hash. Could be done in parallel: just start reading stdin and fire off threads to handle each part at will. And if a hash fails, re-upload that part. So I assume you're thinking of an uploadmultipart function that does just this?

Now back to the uploadID: you can not be sure to get that from the user. I can easily think of two scenarios where the user doesn't have it, and there are likely more:

glacier-cmd is started by the backup software, and runs completely unattended in the background.
glacier-cmd is run from the command prompt, but a power failure kills the 10 GB upload 80% in and user did not copy/paste and save the uploadID anywhere before the crash.

So we'll have to save it somewhere, if only to be nice to the user. If the user provides an uploadID, we still must check all the hashes of all the already-uploaded parts to make sure the upload went correctly, and that the user is not presenting a different file than with the first upload. SimpleDB is an obvious location to store this uploadID, but only makes sense in combination with actual file data (name/size), to improve on the resumption options. A local file could also be a suitable option, but we have to find a good place to store such data (e.g. ~/.glacier-cmd/ - but that doesn't work for background processes like my own backup process, the user of which is a system user without home dir).

I agree no need to put up too many requirements for the user, on the other hand we can use any options that are available to improve functionality.

gburca commented 11 years ago

Now back to the uploadID: you can not be sure to get that from the user.

The listmultiparts sub-command I already added gives you the uploadID for all the ongoing uploads. The user can simply use that sub-command first to get the uploadID that he wants to resume.

gburca commented 11 years ago

BTW, I already added a list_parts() method to GlacierVault as part of an earlier commit. It takes as argument the uploadID (aka multipart_id). See: https://github.com/uskudnik/amazon-glacier-cmd-interface/blob/21f5005da01a8eb495878fabccc8ed647c92b927/glacier/glaciercorecalls.py#L147

gburca commented 11 years ago

currently upload sends parts sequentially ... So missing intermediate parts shouldn't happen.

That's assuming the user used this tool to create the failed upload.

wvmarle commented 11 years ago

Indeed I assume the user uses this tool; support for issues resulting from other tools is secondary to me, and low priority.

Most important I think would be now to add support for resumption of interrupted uploads, and that again with minimal user intervention.So basically all user has to do is: $ glacier-cmd upload MyBigFile.tgz MyVault It runs for a bit, then computer has to shut down and upload is aborted. Sometime later, user again issues the same command and the upload automatically checks checksums of already uploaded blocks, and then resumes where it was. No need to bother with figuring out an uploadID or whatever - that should be transparent to the user. Even better: upon running this tool it checks for aborted uploads, and asks user whether to resume them.

wvmarle commented 11 years ago

And for those using background tasks: a general "resume all aborted uploads" command is great for servers that can run this upon boot-up, so in case there is an aborted upload it will automatically continue the task. Or for home users that need days to upload their photo archive to Glacier over dead-slow ADSL upstream and don't want to keep their computers running all the time.

offlinehacker commented 11 years ago

That would be awsome, but i think auto resume features should require sdb. On Sep 29, 2012 4:43 AM, "wvmarle" notifications@github.com wrote:

And for those using background tasks: a general "resume all aborted uploads" command is great for servers that can run this upon boot-up, so in case there is an aborted upload it will automatically continue the task. Or for home users that need days to upload their photo archive to Glacier over dead-slow ADSL upstream and don't want to keep their computers running all the time.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/33#issuecomment-9000157.

gburca commented 11 years ago

Strictly speaking, all you really need in order to resume an upload is the original archive/data and the uploadID. Agreed?

SimpleDB would only be needed to support higher-level, user-friendly features.

I suggest we break this down into two (or more) separate issues. Let's use this issue to discuss and address the core functionality, and open new issues for the user-friendly features that can be built on top of the core functionality.

wvmarle commented 11 years ago

Strictly speaking, all you really need in order to resume an upload is the original archive/data and the uploadID. Agreed?

Sure, but you have to store the ID somewhere.

I still think it's not realistic to ask user to keep track of the uploadid - they also have to store it somewhere. The common way of checking whether a file is the same is name/size/last modification time, and if that's the same we assume file is the same, which can then be verified by the much more expensive hash test. If any of those are different, file is different and needs re-upload.

Glacier itself has too little info for this; though one can argue that if file name is the same (and we have that in Glacier) the file is likely the same and then we can at least start a hash comparison.

offlinehacker commented 11 years ago

Well why not, if user does not want to use sdb it's reasonable for him to keep any information needed latter. And i agree to keep the discussion here limited to core and only talk what we can do without any local/remote kind of metastorage. On Oct 1, 2012 5:59 AM, "wvmarle" notifications@github.com wrote:

Strictly speaking, all you really need in order to resume an upload is the original archive/data and the uploadID. Agreed?

Sure, but you have to store the ID somewhere.

I still think it's not realistic to ask user to keep track of the uploadid

they also have to store it somewhere. The common way of checking whether a file is the same is name/size/last modification time, and if that's the same we assume file is the same, which can then be verified by the much more expensive hash test. If any of those are different, file is different and needs re-upload.

Glacier itself has too little info for this; though one can argue that if file name is the same (and we have that in Glacier) the file is likely the same and then we can at least start a hash comparison.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/33#issuecomment-9021404.

gburca commented 11 years ago

@wvmarle Sure, but you have to store the ID somewhere.

No, you don't. In fact storing it locally (or in SimpleDB) leads to other issues. Let me repeat what I said earlier:

The listmultiparts sub-command I already added gives you the uploadID for all the ongoing uploads. The user can simply use that sub-command first to get the uploadID that he wants to resume.

Is there a reason we can't retrieve that information from Glacier as indicated above?

wvmarle commented 11 years ago

I'll give it a try soon. I am already planning to completely overhaul the upload function (in the glacier_lib branch), as I think it's really poorly written now. Particularly the amount of memory it uses is a big issue, and needs to be solved. Basically what I plan to do:

upload reads from file or stdin until it has a complete block (or EOF), then sends the complete block to writer which will then write it out. No extra buffering, no need for that. upload is buffering already, and the OS will buffer stdin if necessary, so I don't see the need to have yet another buffer in writer.
close function will do just that: close the connection. No more buffer remains to be written out.
add automatic resumption: check whether file name can be found in the listmultiparts, if so get hashes from completed blocks and start comparing with local file to verify it is the same. If not, start new upload, new archive. Amazon will cancel the incomplete one over time (that's my experience at least - all my aborted test uploads seem to be disappearing after about a day of inactivity). Automatic resumption from data received over stdin is a bit trickier, will have to add a flag for that with the user providing the upload id. In that case again we must do hash checking, and I think if the first block fails give up on resumption and start writing a new archive. Not exactly hopeful that this is an actual resumption job.

offlinehacker commented 11 years ago

Can you fix this directly in boto, because idea is to migrate one day? Just fork boto/boto fix there and try to merge in ther repo, they will be very happy to have better implementation. Thanks!

wvmarle commented 11 years ago

(edit: I should look more at sources before commenting; pulling stuff over from boto doesn't seem feasible)

For now it seems that boto has their own Glacier upload tool as part of their general set of AWS tools, and that glacier-cmd uses boto only to connect to the Glacier service and the SimpleDB service. For the rest it's independent of their project.

I just looked a bit closer at boto, and found they recently released 2.6.0 which includes glacier support - so probably no need for the development version of boto to make glacier-cmd work. I didn't try that out, yet.

Is there any contact by you guys with the boto developers? How serious is this merge idea?

Some general considerations:

boto is primarily a library, they apparently provide a simple front-end too. And that is really a simple front-end, just 123 loc, not doing much. See https://github.com/boto/boto/blob/develop/bin/glacier
glacier-cmd has a library (GlacierWrapper and glaciercorecalls) where glaciercorecalls could be replaced by boto's implementation of these calls, though GlacierWrapper adds another level of abstraction too, and integration with SimpleDB making vault management potentially a lot easier (if someone cares to write a GUI using this wrapper).

uskudnik commented 11 years ago

Well, I was in contacting with them and they were interested but nothing has been done in this direction yet. I was planing on more detailed talk when I migrate core calls to boto itself and we have a stable core feature set (1.0ish if you like).

Yes, they have their own simple tools, which already caused us some problems but was now resolved by renaming our utility to glacier-cmd. Whether are not they will prefer our greater feature set (Simple DB integration being the primary addition in my mind) remains to be seen...

wvmarle commented 11 years ago

Got multipart resumption working.

A bit of a struggle, but then suddenly it just worked. So I can resume my upload that blocked last night :-)

In the process I've ditched most of the existing glaciercorecalls and am calling boto directly. For some reason the pagination marker just didn't work in glaciercorecalls.

It's in my own development branch; hope to merge back soon after I checked that the rest of the functions still work (keyword mismatch and so).

I've now implemented the --uploadid switch for when uploading a file to indicate that this is a resumption of a multipart upload. User will have to hunt down the uploadid via the listmultiparts option.

I've also added a --resume switch which is not implemented yet, this is supposed to be a bit smarter, going to use SimpleDB for recovering the uploadid. There is no way to guess an uploadid by file name, as Glacier doesn't store file names.

gburca commented 11 years ago

See my comment on issue #69 for why the pagination marker is no longer working...

wvmarle commented 11 years ago

May try that as well. In the meantime using Boto's direct calls - on the upside it massively simplifies the code, it even simplifies calls from GlacierWrapper as the subclass GlacierVault is not needed any more.

So what's the way forward?

a) further integrate with boto and make GlacierWrapper call Boto directly, or

b) become independent from Boto by lifting their AWS authorisation code or rolling our own?

I'd vote for the first - less work!

offlinehacker commented 11 years ago

I'd vote for first too. If any additional functionality, that boto hasn't thinked of, needs to be implemented, we make a fork and point glacier-cmd's dependencies there until boto merges our changes. On Oct 8, 2012 7:08 AM, "wvmarle" notifications@github.com wrote:

May try that as well. In the meantime using Boto's direct calls - on the upside it massively simplifies the code, it even simplifies calls from GlacierWrapper as the subclass GlacierVault is not needed any more.

So what's the way forward?

a) further integrate with boto and make GlacierWrapper call Boto directly, or

b) become independent from Boto by lifting their AWS authorisation code or rolling our own?

I'd vote for the first - less work!

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/33#issuecomment-9216554.

wvmarle commented 11 years ago

OK, will continue that route then. Quite some changes to make as many things work subtly different (different names of arguments, slightly different responses, different errors to catch).

Major extra functionality present in glacier-cmd that I can think of:

bookkeeping - and everything related to this.
resumption of interrupted uploads - they provide the tools but not a complete implementation with hash checks and the like. Needs hash-check-progress indication.
calculation of tree hashes (unless it's elsewhere in their library).
useful formatting of output - it typically comes as dict. Great for processing by other Python code, not so much for human eyes.

And I'm considering a function like taking a tree hash of files the user wants to upload, check it against the bookkeeping, and print a warning if this file is in Glacier already. Prevents double uploads, and allows a user to easily check whether a file is in their vault already (needs bookkeeping, or a current inventory to be available, maybe restrict to bookkeeping required).

Planned new functions, also not in Boto:

automatic resumption of all interrupted uploads (needs bookkeeping to keep track of this)
when uploading file, automatically resume upload if part is uploaded already (needs bookkeeping)
find a way to easily pass in a list of files to upload (current to me as I'm breaking up backups in 50 MB chunks, and then have a couple hundred of those to upload when backup finished - this as it helps me staying in the free download limits in case I actually need it, and Bacula will be able to tell me exactly which chunk(s) to download).
automatic download of archives and inventory retrieval using SNS notifications.

gburca commented 11 years ago

And I'm considering a function like taking a tree hash of files the user wants to upload, check it against the bookkeeping, and print a warning if this file is in Glacier already.

That could become annoying in some cases. Think of a back-up system that uploads daily deltas. On days that there are no changes, the deltas could be identical, yet they would need to be uploaded to maintain a complete backup set. A user would probably not want to see warnings in that case. There are other scenarios where that would be the case.

wvmarle commented 11 years ago

True. File name is different in that case, should check on that as well.

uskudnik commented 11 years ago

automatic download of archives and inventory retrieval using SNS notifications.

I doubt this will end in boto glacier API - they have SNS API anyway so anyone programming with boto would use their SNS module.

How far are you with porting code to boto only calls? I did a bit of hacking during the flight and planned to continue until I saw your comment. If you're half/mostly done I will just start working on SNS...

wvmarle commented 11 years ago

Port to Boto is as good as done. Basically waiting for me to merge back into my master.

For SNS notification that archives are ready indeed should be able to use Boto calls but they won't add anything to link the two I suppose. Just like they don't link Glacier and SimpleDB.

uskudnik / amazon-glacier-cmd-interface

Add support for interrupted multi-part uploads #33