uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

inventory always requests retrieval #70

Closed tuctboh closed 12 years ago

tuctboh commented 12 years ago

Hi,

We uploaded files on Friday, so this morning I requested an inventory. There had been one out there according to the console, but instead I see :

[dbadmin@w-db-06 ~]$ glacier-cmd inventory DB-2011-09 Inventory retrieval in progress. Job ID: q5sn5uG_SA5fojN524g3vt3354f5478686FESeNfu4rn1t0sU0Dbe-wbjen3UYnm-lhU6bthgsd67nm863m. Job started (time in UTC): 2012-10-08T15:15:01.401Z.

I can't describevault, I get :

[dbadmin@w-db-06 ~]$ glacier-cmd describevault DB-2011-09 Traceback (most recent call last): File "/usr/bin/glacier-cmd", line 8, in load_entry_point('glacier==0.2dev', 'console_scripts', 'glacier-cmd')() File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 521, in main args.func(args) File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 71, in wrapper return fn(_args, *_kwargs) File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 118, in describevault print_output(response, keys=keys) File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 47, in print_output table.add_row([line[keys[k]] for k in keys]) TypeError: string indices must be integers

offlinehacker commented 12 years ago

What did we forgot to handle while printing those tables? On Oct 8, 2012 5:44 PM, "tuctboh" notifications@github.com wrote:

Hi,

We uploaded files on Friday, so this morning I requested an inventory. There had been one out there according to the console, but instead I see :

[dbadmin@w-db-06 ~]$ glacier-cmd inventory DB-2011-09 Inventory retrieval in progress. Job ID: q5sn5uG_SA5fojN524g3vt3354f5478686FESeNfu4rn1t0sU0Dbe-wbjen3UYnm-lhU6bthgsd67nm863m. Job started (time in UTC): 2012-10-08T15:15:01.401Z.

I can't describevault, I get :

[dbadmin@w-db-06 ~]$ glacier-cmd describevault DB-2011-09 Traceback (most recent call last): File "/usr/bin/glacier-cmd", line 8, in load_entry_point('glacier==0.2dev', 'console_scripts', 'glacier-cmd')() File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 521, in main args.func(args) File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 71, in wrapper return fn(_args, *_kwargs) File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 118, in describevault print_output(response, keys=keys) File "/opt/python2.7/lib/python2.7/site-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 47, in print_output table.add_row([line[keys[k]] for k in keys]) TypeError: string indices must be integers

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/issues/70.

wvmarle commented 12 years ago

Testing. That's what was forgotten :-( Ran into this issue myself already; something changed upstream and broke stuff on unexpected places.

Wrapping up my new wrapper now; doing final tests and all seems to work. This is using Boto calls as much as possible, and supports for resumption of interrupted uploads.

tuctboh commented 12 years ago

How about for the inventory? Is there some check that says "If last inventory is > 4 hours then reget"?

wvmarle commented 12 years ago

I'm a bit confused about this issue.

We don't have an automatic 'reget' based on time. If no inventory job available, a new one will be created automatically, or use the --refresh flag to initiate a job there and then.

Amazon themselves take an inventory of your vaults once every 24 hours or so. Whether initiating an inventory job will actually refresh your inventory faster and actually get you the latest information (and what is 'latest' as it takes about four hours just to give a list of files?!), I'm not sure of. It seems not, based on what I've read about it and my experience playing around with it.

tuctboh commented 12 years ago

If it makes you feel any better, I share the confusion. :)

Its my understanding that when I had uploaded the files to Glacier, it would index them within 24 hours. My vault via the console is showing "Inventory Last Updated: Sat, October 06, 2012 08:14:02 AM UTC-4". So if this is the case, why is "glacier-cmd inventory DB-2011-09" requesting a refresh, shouldn't it just pull whats already there? I've yet to request an "inventory" without it telling me it was running a job. I still don't know whats in my Vault. (And yea, 4 hours? Does an intern transcribe it off a special screen or something? ;) )

uskudnik commented 12 years ago

If I understand your problem correctly, refresh is requested because output of the job was already discarded - it takes 4 hours to get a job output and that output stays available up to 24h.

Could that be the case?

tuctboh commented 12 years ago

AH. I didn't realize its only available 24 hours. I'll wait for the job to finish and see if I can retrieve it then. Thank you for your understanding and help.

tuctboh commented 12 years ago

I received notification inventory ran, and then asked for the inventory. I got it! Then I wrote a program to verify the local checksum with the remote checksum. Out of 300 files 210 match, 89 don't match, 1 missing. My offload script was : for i in ls */*2011-09* ; do echo $i; glacier-cmd upload DB-2011-09 $i "DB Archive - $i"; done

Check script is as follows :

!/bin/bash

for line in ls */*$1* do

logline=$(grep $line $2|awk -F"|" '{ print $5,$6}'|sed 's/\ //g'|sed 's/DBArchive-/|/g') split=($(echo $logline|tr "|" "\n")) sha256sum=${split[0]} filename=${split[1]}

loutput=($(sha256sum $filename|tr " " "\n")) if [ ${loutput[0]} != $sha256sum ]; then echo $filename does not match ${loutput[0]} vs $sha256sum else echo $filename fi done

(Its ugly shell since I'm not allowed to use perl here. ;) )

Any idea why such a high rate of mismatches? I can see 1 file timing out possibly. Is this something with the "over 1MB" issue? The files that failed were definitely more than 1MB.

wvmarle commented 12 years ago

Yes, over 1MB it will fail. You must take a "tree-hash" for those.

Relevant code (will add a command to glacier-cmd to take a tree hash of an existing file):

def chunk_hashes(data):
    """
    Break up the byte-string into 1MB chunks and return sha256 hashes
    for each.
    """
    chunk = 1024*1024
    chunk_count = int(math.ceil(len(data)/float(chunk)))
    return [hashlib.sha256(data[i*chunk:(i+1)*chunk]).digest() for i in range(chunk_count)]

def tree_hash(fo):
    """
    Given a hash of each 1MB chunk (from chunk_hashes) this will hash
    together adjacent hashes until it ends up with one big one. So a
    tree of hashes.
    """
    hashes = []
    hashes.extend(fo)
    while len(hashes) > 1:
        new_hashes = []
        while True:
            if len(hashes) > 1:
                first = hashes.pop(0)
                second = hashes.pop(0)
                new_hashes.append(hashlib.sha256(first + second).digest())
            elif len(hashes) == 1:
                only = hashes.pop(0)
                new_hashes.append(only)
            else:
                break
        hashes.extend(new_hashes)
    return hashes[0]

data = open('file_to_check').read()
print tree_hash(chunk_hashes(data))

With 'file_to_check' the file you want to check. The above code will attempt to read the complete file in memory, and probably crash spectacularly for big files, so glacier-cmd is using a bit more smarts to handle this. Anyway for not too huge files you can use this to check your hashes.

tuctboh commented 12 years ago

Hi,

Thanks... I added an import of math and hashlib..... But for my file I'm getting :

[tuc@valhalla Desktop]$ python treehash.py �S⡕~P{���b�)��`�Y v�9Ft�b

Tuc

wvmarle commented 12 years ago

That is byte code, has to be converted to hex to make it human readable:

def bytes_to_hex(str):
    return ''.join( [ "%02x" % ord( x ) for x in str] ).strip()

And in the meantime I've just added a treehash command to glacier-cmd to help you calculate the tree hash. So (as soon as my patches have been merged) you can do $glacier-cmd treehash <filename>

tuctboh commented 12 years ago

You guys rock. Sorry for all the newbiness. And thank you for the new command and previous fix.

wvmarle commented 12 years ago

Welcome. You're not the only newbie here :-)

And I think this issue can be closed?

uskudnik commented 12 years ago

I think so to.

tuctboh commented 11 years ago

Please close. That way I won't feel bad opening the next issue.