uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

Uploading large file: SHA hash fails. #36

Closed wvmarle closed 11 years ago

wvmarle commented 11 years ago

I've seemingly successfully uploaded a large file, but now I'm getting an SHA hash mismatch.

Uploading results large file (my AWS key etc. are in the /etc/glacier.cfg file); I noticed that the final progress update is not the actual number of bytes; only tomorrow I can check the inventory of my vault as glacier is slow in updating even that.

$ glacier -c /etc/glacier.cfg upload Squirrel_backup /backup/bacula/Squirrel-Users.2012-09-24_13.58.55_06 Squirrel-Users.2012-09-24_13.58.55_06 Wrote 9,126,805,504 bytes. Created archive with ID: [removed] Archive SHA256 hash: ff04e6df54b2dbba3929eb41df2cb529be72ed98f2f02323a931e8b561eb8bab

$ sha256sum Squirrel-Users.2012-09-24_13.58.55_06 0b2bc8d7720eaa45e15894596e120caeb4b7beede6193b06f92e36884193e2e5 Squirrel-Users.2012-09-24_13.58.55_06

$ ls -l Squirrel-Users.2012-09-24_13.58.55_06 -rw-r----- 1 bacula tape 9126812816 Sep 24 20:07 Squirrel-Users.2012-09-24_13.58.55_06

However uploading a small file gives me a correct SHA hash:

$ glacier -c /etc/glacier.cfg upload Squirrel_backup /backup/bacula/Squirrel-Users.2012-09-25_04.05.00_08 Squirrel-Users.2012-09-25_04.05.00_08 Wrote 0 bytes. Created archive with ID: [removed] Archive SHA256 hash: e70eb3cf6144b2541da1c54dfb953327d1782f7128c429f920c6886f8bd0d2e1

$ sha256sum Squirrel-Users.2012-09-25_04.05.00_08 e70eb3cf6144b2541da1c54dfb953327d1782f7128c429f920c6886f8bd0d2e1 Squirrel-Users.2012-09-25_04.05.00_08

$ ls -l Squirrel-Users.2012-09-25_04.05.00_08 -rw-r----- 1 bacula tape 168731 Sep 25 13:09 Squirrel-Users.2012-09-25_04.05.00_08

uskudnik commented 11 years ago

Thats... weird :) More so because Amazon stops the upload if hash of any part does not equal the one uploaded (see http://docs.amazonwebservices.com/amazonglacier/latest/dev/api-upload-part.html). Do report what inventory retrieval returns...

gburca commented 11 years ago

Your uploads are probably just fine. Don't forget that Amazon (and the script) don't compute a straight SHA256. The relevant documentation is: http://docs.amazonwebservices.com/amazonglacier/latest/dev/checksum-calculations.html

What they compute is essentially the hash of a tree of hashes, so that's why it doesn't match what you're computing.

It might be useful to add the option to the tool to compute the hash of a file the Glacier way, so that users can compare it with what Amazon is reporting.

gburca commented 11 years ago

It looks like @jose1711 asked for the same enhancement in issue #11. The script computes the SHA256 tree, but only during upload. There's no way to do it after the fact, to verify the SHA's match.

wvmarle commented 11 years ago

Just going through the code to figure out what's going on. Another issue that I noticed is that the final number of bytes written does not match the actual size of the file sent. And now I found an interesting piece of code, glaciercorecall.py line 314 on:

def write(self, str):
    assert not self.closed, "Tried to write to a GlacierWriter that is already closed!"
    self.buffer.append(str)
    self.buffer_size += len(str)
    while self.buffer_size > self.part_size:
        self.send_part()

First of all I suggest renaming str to part as str is not a good variable name (it's a reserved word).

When the object is initialised, self.buffer_size == 0

In case of a large file, when the first 128 MB chunk is to be sent, the buffer is filled, but in that case self.buffer_size == self.part_size and no sending is done. Only when the second chunk is sent, a part will be sent.

As a result, a file with size smaller than self.part_size is not sent at all, and the last part of a big file is also not sent.

Maybe I'm reading the code wrong but this is what it looks like to me. It fully explains to me why the final number of bytes by the status indicator is wrong, and it could also explain why the hash doesn't match (the file sent is incomplete!).

offlinehacker commented 11 years ago

There's another piece of code that sends the last part when writer is closed:

def close(self):
    if self.closed:
        return
    if self.buffer_size > 0:
        self.send_part()

If there's a problem i think there must be somewhere else. Have you donwloaded archive and checked it's checksum again? I will try to figure this out, but i pretty wonder, because nobody else has reported error like this before. Can anybody else confirm this problems(i have hard time uploading "such big" files to glacier)? @uskudnik what do you think? We also need to add logging to glaciercorecalls or we will be always lost like now.

uskudnik commented 11 years ago

Yup, I was looking at logging already and it is a must.

As for the problem - nope, I haven't heard of it before and as far as I am aware people have managed to upload multi-GB files without problems...

As for hash mismatch - what @gburca said, I forgot about hash of tree of hashes, so adding an option to run validate might be a good idea.

wvmarle commented 11 years ago

I noticed the close() function later. Somehow the final byte count seems to be working now, too. No idea why that wasn't updated.

I haven't downloaded any archive yet - in a day or two I will have done so, I must make sure I can do this. But Glacier is not exactly fast :-)

uskudnik commented 11 years ago

Cool. Closing.