uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

Add support for interrupted multi-part uploads #33

Open gburca opened 11 years ago

gburca commented 11 years ago

Currently if a multi-part upload fails for some reason, there's no way to continue uploading from where the previous upload left off. That's a problem for large archives.

In order to resume multipart uploads, the script would need to:

wvmarle commented 11 years ago

Very important one indeed. I suggest to store progress data in a SimpleDB. Basically what you need to store there is the file name, block size used and number of blocks successfully uploaded, and whether upload is complete. This also allows for automatic resumption, if user restarts the upload it has to check whether we tried to upload this file already, if so the progress, and start from there.

Related: how about have upload check the bookkeeping db whether this file (check for identical name/byte size - should be good enough, maybe hash to make sure but that takes really long for large files) is uploaded already.

wvmarle commented 11 years ago

Let me elaborate a little on my idea:

  1. start upload file procedure .
  2. check whether file with same file name and total size has entry in the bookkeeping db.
    1. It has none: create entry in bookkeeping db: file name, total size, chunks uploaded = 0, ...; chunk_count = 0
    2. It has: fetch number of uploaded blocks chunk_count from db.
  3. take next chunk of file (by chunk_count)
  4. upload chunk, check return hash for success
  5. if success: update entry in db, increase block size by 1, goto 3.
  6. upload finished; record this in db.

The occurrence of file of same size and same name being different is I think small enough to ask user to take care of that themselves, and use the --replace option for that situation.

gburca commented 11 years ago

We need to keep in mind 2 overall requirements for this feature:

  1. It should work without SimpleDB since that's an optional portion and not everyone will have it enabled.
  2. It should work with data coming from STDIN as well as from a file.

Your design has hard dependencies on either the database being there, or on a file being used.

I've outlined at the top a few simple steps required for this feature:

  1. Allow the user to get the uploadID for unfinished uploads
  2. Allow the user to pass the uploadID as an optional parameter to the "upload" subcommand
  3. Figure out (and transmit) the missing pieces when "upload" is called with an uploadID
  4. Allow the user to abort an unfinished upload

I've already added the listmultiparts subcommand. That takes care of #1. I've also added abortmultipart which takes care of #4. We still need to do #2. If the upload subcommand sees an uploadID, it means the user is resuming an upload, so we will need to:

  1. List the parts of an in-progress multipart upload. There's an API for getting that information straight from Glacier. It takes as an input the uploadID provided by the user. No need to depend on SImpleDB. See: http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-multipart-list-parts.html
  2. Figure out the missing parts, and send them out.

I would shy away from making any assumptions such as the filename being unique (or present), or of there being no need to upload multiple archives with the same name, etc... Unless the underlying service (AWS Glacier) imposes a limitation, the tool should not impose its own arbitrary ones.

gburca commented 11 years ago

For error handling, there's no need to validate the uploadID since we'll get a failure from AWS if it's invalid and can pass that straight to the user.

We should however probably use the SHA256's returned by AWS when we get the list of parts that are already uploaded, and verify that they match the data the user is providing for the resumption attempt. Keep in mind you can't randomly seek STDIN, so if AWS says it has parts 1, 2, 5, 6, there's no way to verify the SHA256 for parts 5 and 6 before uploading parts 3 and 4 (or saving them to some temporary location?) because you have no way of seeking backwards.

wvmarle commented 11 years ago

For the multipart resume, a few things.

Now back to the uploadID: you can not be sure to get that from the user. I can easily think of two scenarios where the user doesn't have it, and there are likely more:

So we'll have to save it somewhere, if only to be nice to the user. If the user provides an uploadID, we still must check all the hashes of all the already-uploaded parts to make sure the upload went correctly, and that the user is not presenting a different file than with the first upload. SimpleDB is an obvious location to store this uploadID, but only makes sense in combination with actual file data (name/size), to improve on the resumption options. A local file could also be a suitable option, but we have to find a good place to store such data (e.g. ~/.glacier-cmd/ - but that doesn't work for background processes like my own backup process, the user of which is a system user without home dir).

I agree no need to put up too many requirements for the user, on the other hand we can use any options that are available to improve functionality.

gburca commented 11 years ago

Now back to the uploadID: you can not be sure to get that from the user.

The listmultiparts sub-command I already added gives you the uploadID for all the ongoing uploads. The user can simply use that sub-command first to get the uploadID that he wants to resume.

gburca commented 11 years ago

BTW, I already added a list_parts() method to GlacierVault as part of an earlier commit. It takes as argument the uploadID (aka multipart_id). See: https://github.com/uskudnik/amazon-glacier-cmd-interface/blob/21f5005da01a8eb495878fabccc8ed647c92b927/glacier/glaciercorecalls.py#L147

gburca commented 11 years ago

currently upload sends parts sequentially ... So missing intermediate parts shouldn't happen.

That's assuming the user used this tool to create the failed upload.

wvmarle commented 11 years ago

Indeed I assume the user uses this tool; support for issues resulting from other tools is secondary to me, and low priority.

Most important I think would be now to add support for resumption of interrupted uploads, and that again with minimal user intervention.So basically all user has to do is: $ glacier-cmd upload MyBigFile.tgz MyVault It runs for a bit, then computer has to shut down and upload is aborted. Sometime later, user again issues the same command and the upload automatically checks checksums of already uploaded blocks, and then resumes where it was. No need to bother with figuring out an uploadID or whatever - that should be transparent to the user. Even better: upon running this tool it checks for aborted uploads, and asks user whether to resume them.

wvmarle commented 11 years ago

And for those using background tasks: a general "resume all aborted uploads" command is great for servers that can run this upon boot-up, so in case there is an aborted upload it will automatically continue the task. Or for home users that need days to upload their photo archive to Glacier over dead-slow ADSL upstream and don't want to keep their computers running all the time.

offlinehacker commented 11 years ago

That would be awsome, but i think auto resume features should require sdb. On Sep 29, 2012 4:43 AM, "wvmarle" notifications@github.com wrote:

And for those using background tasks: a general "resume all aborted uploads" command is great for servers that can run this upon boot-up, so in case there is an aborted upload it will automatically continue the task. Or for home users that need days to upload their photo archive to Glacier over dead-slow ADSL upstream and don't want to keep their computers running all the time.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/33#issuecomment-9000157.

gburca commented 11 years ago

Strictly speaking, all you really need in order to resume an upload is the original archive/data and the uploadID. Agreed?

SimpleDB would only be needed to support higher-level, user-friendly features.

I suggest we break this down into two (or more) separate issues. Let's use this issue to discuss and address the core functionality, and open new issues for the user-friendly features that can be built on top of the core functionality.

wvmarle commented 11 years ago

Strictly speaking, all you really need in order to resume an upload is the original archive/data and the uploadID. Agreed?

Sure, but you have to store the ID somewhere.

I still think it's not realistic to ask user to keep track of the uploadid - they also have to store it somewhere. The common way of checking whether a file is the same is name/size/last modification time, and if that's the same we assume file is the same, which can then be verified by the much more expensive hash test. If any of those are different, file is different and needs re-upload.

Glacier itself has too little info for this; though one can argue that if file name is the same (and we have that in Glacier) the file is likely the same and then we can at least start a hash comparison.

offlinehacker commented 11 years ago

Well why not, if user does not want to use sdb it's reasonable for him to keep any information needed latter. And i agree to keep the discussion here limited to core and only talk what we can do without any local/remote kind of metastorage. On Oct 1, 2012 5:59 AM, "wvmarle" notifications@github.com wrote:

Strictly speaking, all you really need in order to resume an upload is the original archive/data and the uploadID. Agreed?

Sure, but you have to store the ID somewhere.

I still think it's not realistic to ask user to keep track of the uploadid

  • they also have to store it somewhere. The common way of checking whether a file is the same is name/size/last modification time, and if that's the same we assume file is the same, which can then be verified by the much more expensive hash test. If any of those are different, file is different and needs re-upload.

Glacier itself has too little info for this; though one can argue that if file name is the same (and we have that in Glacier) the file is likely the same and then we can at least start a hash comparison.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/33#issuecomment-9021404.

gburca commented 11 years ago

@wvmarle Sure, but you have to store the ID somewhere.

No, you don't. In fact storing it locally (or in SimpleDB) leads to other issues. Let me repeat what I said earlier:

The listmultiparts sub-command I already added gives you the uploadID for all the ongoing uploads. The user can simply use that sub-command first to get the uploadID that he wants to resume.

Is there a reason we can't retrieve that information from Glacier as indicated above?

wvmarle commented 11 years ago

I'll give it a try soon. I am already planning to completely overhaul the upload function (in the glacier_lib branch), as I think it's really poorly written now. Particularly the amount of memory it uses is a big issue, and needs to be solved. Basically what I plan to do:

offlinehacker commented 11 years ago

Can you fix this directly in boto, because idea is to migrate one day? Just fork boto/boto fix there and try to merge in ther repo, they will be very happy to have better implementation. Thanks!

wvmarle commented 11 years ago

(edit: I should look more at sources before commenting; pulling stuff over from boto doesn't seem feasible)

For now it seems that boto has their own Glacier upload tool as part of their general set of AWS tools, and that glacier-cmd uses boto only to connect to the Glacier service and the SimpleDB service. For the rest it's independent of their project.

I just looked a bit closer at boto, and found they recently released 2.6.0 which includes glacier support - so probably no need for the development version of boto to make glacier-cmd work. I didn't try that out, yet.

Is there any contact by you guys with the boto developers? How serious is this merge idea?

Some general considerations:

  1. boto is primarily a library, they apparently provide a simple front-end too. And that is really a simple front-end, just 123 loc, not doing much. See https://github.com/boto/boto/blob/develop/bin/glacier
  2. glacier-cmd has a library (GlacierWrapper and glaciercorecalls) where glaciercorecalls could be replaced by boto's implementation of these calls, though GlacierWrapper adds another level of abstraction too, and integration with SimpleDB making vault management potentially a lot easier (if someone cares to write a GUI using this wrapper).
uskudnik commented 11 years ago

Well, I was in contacting with them and they were interested but nothing has been done in this direction yet. I was planing on more detailed talk when I migrate core calls to boto itself and we have a stable core feature set (1.0ish if you like).

Yes, they have their own simple tools, which already caused us some problems but was now resolved by renaming our utility to glacier-cmd. Whether are not they will prefer our greater feature set (Simple DB integration being the primary addition in my mind) remains to be seen...

wvmarle commented 11 years ago

Got multipart resumption working.

A bit of a struggle, but then suddenly it just worked. So I can resume my upload that blocked last night :-)

In the process I've ditched most of the existing glaciercorecalls and am calling boto directly. For some reason the pagination marker just didn't work in glaciercorecalls.

It's in my own development branch; hope to merge back soon after I checked that the rest of the functions still work (keyword mismatch and so).

I've now implemented the --uploadid switch for when uploading a file to indicate that this is a resumption of a multipart upload. User will have to hunt down the uploadid via the listmultiparts option.

I've also added a --resume switch which is not implemented yet, this is supposed to be a bit smarter, going to use SimpleDB for recovering the uploadid. There is no way to guess an uploadid by file name, as Glacier doesn't store file names.

gburca commented 11 years ago

See my comment on issue #69 for why the pagination marker is no longer working...

wvmarle commented 11 years ago

May try that as well. In the meantime using Boto's direct calls - on the upside it massively simplifies the code, it even simplifies calls from GlacierWrapper as the subclass GlacierVault is not needed any more.

So what's the way forward?

a) further integrate with boto and make GlacierWrapper call Boto directly, or

b) become independent from Boto by lifting their AWS authorisation code or rolling our own?

I'd vote for the first - less work!

offlinehacker commented 11 years ago

I'd vote for first too. If any additional functionality, that boto hasn't thinked of, needs to be implemented, we make a fork and point glacier-cmd's dependencies there until boto merges our changes. On Oct 8, 2012 7:08 AM, "wvmarle" notifications@github.com wrote:

May try that as well. In the meantime using Boto's direct calls - on the upside it massively simplifies the code, it even simplifies calls from GlacierWrapper as the subclass GlacierVault is not needed any more.

So what's the way forward?

a) further integrate with boto and make GlacierWrapper call Boto directly, or

b) become independent from Boto by lifting their AWS authorisation code or rolling our own?

I'd vote for the first - less work!

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/33#issuecomment-9216554.

wvmarle commented 11 years ago

OK, will continue that route then. Quite some changes to make as many things work subtly different (different names of arguments, slightly different responses, different errors to catch).

Major extra functionality present in glacier-cmd that I can think of:

And I'm considering a function like taking a tree hash of files the user wants to upload, check it against the bookkeeping, and print a warning if this file is in Glacier already. Prevents double uploads, and allows a user to easily check whether a file is in their vault already (needs bookkeeping, or a current inventory to be available, maybe restrict to bookkeeping required).

Planned new functions, also not in Boto:

gburca commented 11 years ago

And I'm considering a function like taking a tree hash of files the user wants to upload, check it against the bookkeeping, and print a warning if this file is in Glacier already.

That could become annoying in some cases. Think of a back-up system that uploads daily deltas. On days that there are no changes, the deltas could be identical, yet they would need to be uploaded to maintain a complete backup set. A user would probably not want to see warnings in that case. There are other scenarios where that would be the case.

wvmarle commented 11 years ago

True. File name is different in that case, should check on that as well.

uskudnik commented 11 years ago

automatic download of archives and inventory retrieval using SNS notifications.

I doubt this will end in boto glacier API - they have SNS API anyway so anyone programming with boto would use their SNS module.

How far are you with porting code to boto only calls? I did a bit of hacking during the flight and planned to continue until I saw your comment. If you're half/mostly done I will just start working on SNS...

wvmarle commented 11 years ago

Port to Boto is as good as done. Basically waiting for me to merge back into my master.

For SNS notification that archives are ready indeed should be able to use Boto calls but they won't add anything to link the two I suppose. Just like they don't link Glacier and SimpleDB.