uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

Multipart uploads: can't get list of parts #69

Closed wvmarle closed 11 years ago

wvmarle commented 11 years ago

I have a problem fetching the parts list of an interrupted multipart upload. Whatever I try, I can only manage to get the first 50 chunks. I'm stuck at that point.

The marker parameter (supposed to give next page) nor the limit parameters seem to do anything. I tried to limit to less than 50, get 50. Tried to limit to 100, get 50. This is a multipart that got timed out some 880 parts in (at 10%).

I even hacked myself together a branch that's using boto's calls, and got the exact same results.

Any ideas?

Sorry no code as my branch is too messed up at the moment :-)

wvmarle commented 11 years ago

And now it suddenly works... calling boto calls directly (bypassing glaciercorecalls completely).

gburca commented 11 years ago

The reason it doesn't work is that one of my changes got reverted (probably when the recent merges were done). I looked at the master branch, and in glaciercorecalls.py, GlacierVault.make_request is no longer passing on the params argument. To add the change back:

offlinehacker commented 11 years ago

I think calling boto directly is the right way to go :) On Oct 8, 2012 6:20 AM, "Gabriel Burca" notifications@github.com wrote:

The reason it doesn't work is that one of my changes got reverted (probably when the recent merges were done). I looked at the master branch, and in glaciercorecalls.py, GlacierVault.make_request is no longer passing on the params argument. To add the change back:

  • return self.connection.make_request(method, uri, headers, data)
  • return self.connection.make_request(method, uri, headers, data, params=params)

    — Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/69#issuecomment-9216124.

gburca commented 11 years ago

I'm not disagreeing. I'm just pointing out why getting the list of parts, and other commands that depend on pagination markers, are currently broken - until the transition to boto is complete, or until the 1-liner fix I indicated above is re-introduced.

wvmarle commented 11 years ago

I'm currently at like 90% boto direct. Only upload and download are handled partially internal now, rest is all boto calls.

Trying to find a way to cut down on memory usage by reading directly from file, not copying chunks to memory. Download also needs work to do this part by part, instead of all in one go like it's now.

offlinehacker commented 11 years ago

You could cut on memory usage by mmaping parts of file in memory. This way you don't have to change core calls. On Oct 8, 2012 4:53 PM, "wvmarle" notifications@github.com wrote:

I'm currently at like 90% boto direct. Only upload and download are handled partially internal now, rest is all boto calls.

Trying to find a way to cut down on memory usage by reading directly from file, not copying chunks to memory. Download also needs work to do this part by part, instead of all in one go like it's now.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/69#issuecomment-9228299.

wvmarle commented 11 years ago

Good one. Will look into that. Would be great to be able to handle big chunks and not using much memory for it. Hope performance is still good.

offlinehacker commented 11 years ago

The only problem is stdio. In that case mmaping would not work, but in that case i don't see better solution as to read whole part in memory.

Disk I/O speed should generally not be a problem for reading whole file twice(once for hashing and once for upload), but i would still have an option to allow both methods, but changes with using mmap should not be significat anyway. On Oct 8, 2012 8:19 PM, "wvmarle" notifications@github.com wrote:

Good one. Will look into that. Would be great to be able to handle big chunks and not using much memory for it. Hope performance is still good.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/69#issuecomment-9234998.

wvmarle commented 11 years ago

I see stdin as not so important a method (I have no idea why someone would want to use that; large amounts of data I'd normally write to local file before uploading). It's nice to have it but well other than spooling the data to disk and re-reading it there is no way to prevent buffering of complete blocks, after all we must take tree hash before uploading. So if memory is a constraint, user will just have to dump their stream to a local disk first, and then upload it to Glacier. Or use smaller block sizes and not send out too much data.

offlinehacker commented 11 years ago

I think stdin is great method, because you can encrypt and compress on the fly, but i agree in that case you should have enough memory.

You must consider people have lack of disk space, rather than memory and still willing to upload big encrypted and compressed archives.

We would also need support to resume this kind of uploads. On Oct 9, 2012 9:33 AM, "wvmarle" notifications@github.com wrote:

I see stdin as not so important a method (I have no idea why someone would want to use that; large amounts of data I'd normally write to local file before uploading). It's nice to have it but well other than spooling the data to disk and re-reading it there is no way to prevent buffering of complete blocks, after all we must take tree hash before uploading. So if memory is a constraint, user will just have to dump their stream to a local disk first, and then upload it to Glacier. Or use smaller block sizes and not send out too much data.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/69#issuecomment-9251426.

wvmarle commented 11 years ago

Oh yes, good one. Forgot. Bacula encrypts and compresses my archives already so it's not an issue for me.

Support for resumption from stdin is there already (see my latest pull request); it's the exact same code as what handles file resumption. It's the same task after all. It'll just read the data block by block regardless of where it comes from, take the tree hash, and compare it to the hash provided by Glacier.

It seems though (need to investigate more - not sure if I'm correct here) that it breaks in the following situation:

I do not sort blocks; I take a page of 50 blocks and check those, then take the next page of 50 blocks, and check, etc. For files you can just provide the byte range, for stdin it must be consecutive.

So for stdin you would have to first read all pages of blocks, then sort them by byte range, and start checking. This may be a rather lengthy process if you have to get like 20 pages of hashes, which is a waste of time if it then fails, so I didn't do it. And for file it's irrelevant. It's an issue that should be investigated, and fixed for stdin jobs.

I also really gotta add support for Bacula's multi-file list... /path/to/backup/vol001|vol002|vol003|...

As you can imagine my automated upload of backups is broken now :-)