uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

Parallel uploads #91

Open wvmarle opened 11 years ago

wvmarle commented 11 years ago

Parallel uploads; upload resumption; download resumption. All in one go - should be able to apply this to master without conflicts.

offlinehacker commented 11 years ago

Will we ever migrate upload/download process to boto? What are the plans. They have parallel upload support too.

wvmarle commented 11 years ago

Interesting, I missed that part of Boto. Will look into it, maybe it works better than my solution (I always get response errors). It seems no progress updates; may consider expanding the class or even lifting the code and amending it.

SitronNO commented 11 years ago

I have pulled this branch using the following code:

git clone -b parallel_uploads git://github.com/wvmarle/amazon-glacier-cmd-interface.git amazon-glacier-cmd-interface_parallel_uploads

and then build it with:

cd amazon-glacier-cmd-interface_parallel_uploads/
sudo python setup.py install

However, it does not work:

$ glacier-cmd upload Test Privat/amazon/amazon_glacier_testfile.data --description "Random data"
Traceback (most recent call last):
  File "/usr/local/bin/glacier-cmd", line 9, in <module>
    load_entry_point('glacier==0.2dev', 'console_scripts', 'glacier-cmd')()
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 811, in main
    args.func(args)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 147, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 302, in upload
    args.resume, args.sessions)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 211, in glacier_connect_wrap
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 264, in sdb_connect_wrap
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 961, in upload
    part_size = self._check_part_size(part_size, total_size)
  File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 415, in _check_part_size
    part_size = self._next_power_of_2(total_size / (1024*1024*self.MAX_PARTS))
AttributeError: 'GlacierWrapper' object has no attribute 'MAX_PARTS'

Am I doing something wrong, or is there a bug somewhere?

offlinehacker commented 11 years ago

There's a bug. Please uncomment line 100 in GlacierWrapper.py

MAX_PARTS = 10000

On Fri, Oct 26, 2012 at 12:53 PM, Vidar Hoel notifications@github.comwrote:

I have pulled this branch using the following code:

git clone -b parallel_uploads git://github.com/wvmarle/amazon-glacier-cmd-interface.git amazon-glacier-cmd-interface_parallel_uploads

and then build it with:

cd amazon-glacier-cmd-interface_parallel_uploads/ sudo python setup.py install

However, it does not work:

$ glacier-cmd upload Test Privat/amazon/amazon_glacier_testfile.data --description "Random data" Traceback (most recent call last): File "/usr/local/bin/glacier-cmd", line 9, in load_entry_point('glacier==0.2dev', 'console_scripts', 'glacier-cmd')() File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 811, in main args.func(args) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 147, in wrapper return fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 302, in upload args.resume, args.sessions) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper ret = fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 211, in glacier_connect_wrap return func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper ret = fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 264, in sdb_connect_wrap return func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper ret = fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 961, in upload part_size = self._check_part_size(part_size, total_size) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 415, in _check_part_size part_size = self._next_power_of_2(total_size / (1024_1024_self.MAX_PARTS)) AttributeError: 'GlacierWrapper' object has no attribute 'MAX_PARTS'

Am I doing something wrong, or is there a bug somewhere?

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/91#issuecomment-9809306.

offlinehacker commented 11 years ago

... and set it to 1000. Or change variable name of next line.

On Fri, Oct 26, 2012 at 12:57 PM, Jaka Hudoklin jakahudoklin@gmail.comwrote:

There's a bug. Please uncomment line 100 in GlacierWrapper.py

MAX_PARTS = 10000

On Fri, Oct 26, 2012 at 12:53 PM, Vidar Hoel notifications@github.comwrote:

I have pulled this branch using the following code:

git clone -b parallel_uploads git://github.com/wvmarle/amazon-glacier-cmd-interface.git amazon-glacier-cmd-interface_parallel_uploads

and then build it with:

cd amazon-glacier-cmd-interface_parallel_uploads/ sudo python setup.py install

However, it does not work:

$ glacier-cmd upload Test Privat/amazon/amazon_glacier_testfile.data --description "Random data" Traceback (most recent call last): File "/usr/local/bin/glacier-cmd", line 9, in load_entry_point('glacier==0.2dev', 'console_scripts', 'glacier-cmd')() File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 811, in main args.func(args) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 147, in wrapper return fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/glacier.py", line 302, in upload args.resume, args.sessions) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper ret = fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 211, in glacier_connect_wrap return func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper ret = fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 264, in sdb_connect_wrap return func(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 68, in wrapper ret = fn(_args, _kwargs) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 961, in upload part_size = self._check_part_size(part_size, total_size) File "/usr/local/lib/python2.7/dist-packages/glacier-0.2dev-py2.7.egg/glacier/GlacierWrapper.py", line 415, in _check_part_size part_size = self._next_power_of_2(total_size / (1024_1024_self.MAX_PARTS)) AttributeError: 'GlacierWrapper' object has no attribute 'MAX_PARTS'

Am I doing something wrong, or is there a bug somewhere?

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/91#issuecomment-9809306.

wvmarle commented 11 years ago

Whoops - only a 0 was supposed to go, not that S. My bad!

Anyway it seems that the 10,000 parts should also work now?

SitronNO commented 11 years ago

@wvmarle: Yes, both the current code and the code with MAX_PARTS = 10000 works. I have tested both. This code should be merged with the main branch, as it's fixing the bug I reported.

uskudnik commented 11 years ago

A week without any fixes - I will presume this is stable and merge tomorrow unless @wvmarle says otherwise and no new bugs are discovered.

wvmarle commented 11 years ago

As stable as it gets I think. Haven't had much time recently to do anything with the code.

The only issue I have is the continuous and mysterious "response error" replies from Amazon...

SitronNO commented 11 years ago

I get this issue with larger files:

Process Process-1:6.0 GB (76%). Average rate 374.80 KB/s, eta 20:28:50.
Traceback (most recent call last):
  File "/usr/lib/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/glaciercorecalls.py", line 110, in upload_part_process
    writer.write(part, start=start)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/glaciercorecalls.py", line 188, in write
    code=e.code)
ResponseException

At this point it just hangs, so I have to break (press CTRL+C) and that gives the following error:

^CTraceback (most recent call last):
  File "/usr/local/bin/glacier-cmd", line 9, in <module>
    load_entry_point('glacier==0.2dev', 'console_scripts', 'glacier-cmd')()
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/glacier.py", line 811, in main
    args.func(args)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/glacier.py", line 147, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/glacier.py", line 302, in upload
    args.resume, args.sessions)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/GlacierWrapper.py", line 68, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/GlacierWrapper.py", line 211, in glacier_connect_wrap
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/GlacierWrapper.py", line 68, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/GlacierWrapper.py", line 264, in sdb_connect_wrap
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/GlacierWrapper.py", line 68, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/glacier-0.2dev-py2.6.egg/glacier/GlacierWrapper.py", line 1251, in upload
    time.sleep(1)   # manage timeouts and status updates etc.
KeyboardInterrupt

When this happens, I just repeat the same command, just adding a --resume and it goes on from where it broke.

Is this the same error you are referring to, or something you have not seen before?

wvmarle commented 11 years ago

Yes, that's the issue I'm referring to. Very irritating.

Dozens if not hundreds of parts are uploaded and accepted fine, and then suddenly the signature is not accepted (and by my understanding, signature is related to your login credentials - it's done by Boto and I've dug so deep as to know how that is done exactly).

uskudnik commented 11 years ago

@wvmarle Any luck tracking bug down? Do you know if it's boto issue or Amazon?

offlinehacker commented 11 years ago

The problem here is different. First of all boto upload implemementation sucks. When you are doing uploads, calls to http functions they must be able to timeout and you must alway expect that something will fail and detect what failed. If it's disk corruption you must have an option to verify uploads if it's network error you also must be ablo to reapload. Amazon has great api allowing to reapload any part, so if uploading some part fails you can simply upload once again. Upload is critial part. If that is not working you can throw this or any app using this in trash.

...so that's why i'm reimplementing whole upload part fixing some if not all of the problems. On Nov 11, 2012 2:53 AM, "Urban Škudnik" notifications@github.com wrote:

@wvmarle https://github.com/wvmarle Any luck tracking bug down? Do you know if it's boto issue or Amazon?

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/91#issuecomment-10262310.

wvmarle commented 11 years ago

@uskudnik : nothing done yet on this one. Just got myself a set of new toys (including an Epson wifi printer: took me 6 hours to get that installed!! Had to hunt down an unofficial ISO of the installation CD as I don't have CDROM players anymore and the official downloads are broken...) and a new netbook :-) So my priorities are distracted :-) It is an issue that must be tracked down. The hardest part is that it's an error that I don't know how to trigger intentionally.

@offlinehacker : you mean you're re-implementing the boto upload routines? Wasn't that present in the original glaciercorecalls.py file already? Time-out I took care of already; other parts not other than indirectly through the resume function (which has no problem with non-sequential parts to upload).

offlinehacker commented 11 years ago

@wvmarle : I am reimplementing whole uploaded and still deciding if i will use this upload_part routine from boto or not. The problem is current implementation from us and from boot is not good, especially for parallel uploads. Functionality is implemented in one function doing everything and we hope it does not crash. I am taking a lot of code from you and make it a little bit better. And please don't understand wrong your work here is awsome, the problem is upload must not have mistakes and must be implemented without bugs! One thing to say everything will hopefully work when i complete this ;)

Currently the most helpfull part for me would be better formating of exceptions, handling that cause variable, which btw is awsome, and printing whole exception tree(this should work, at least that CausedException had that and was removed somwhere in betwene?). This way we will be able to debug much easier. So instead of trying to debug upload routines once again and again on not very good implementations please help me with implementing things above. I will make some commits during the day and if somebody will test new functionality or write tests i will be also very happy, or else i will have to write tests, before we will even merge anything in master!

And please start writing tests, before you implement anything else, or we will end up in a blob of non working code!

wvmarle commented 11 years ago

CausedException is integrated into GlacierException, and the stack trace is dumped in the log file at DEBUG level. This as the users normally don't need to see this, and this way developers can still get it.

Any non-caught exceptions of course dump the stack trace to screen.

Agreed upload must not have bugs; writing tests otoh is also not easy until we fully and thoroughly understand Amazon's responses (like this response error issue) to be able to simulate errors.

offlinehacker commented 11 years ago

Tnx, but was wondering because we have a lot of copy-pastes without full traces. But if exception occurs, why not printing the whole stack trace to user, they won't understand it mostly anyway but we will. Those exceptions that must be pretty printed of course need to be handled differently.

On Sun, Nov 11, 2012 at 11:54 AM, wvmarle notifications@github.com wrote:

CausedException is integrated into GlacierException, and the stack trace is dumped in the log file at DEBUG level. This as the users normally don't need to see this, and this way developers can still get it.

Any non-caught exceptions of course dump the stack trace to screen.

Agreed upload must not have bugs; writing tests otoh is also not easy until we fully and thoroughly understand Amazon's responses (like this response error issue) to be able to simulate errors.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/91#issuecomment-10265147.

wvmarle commented 11 years ago

Exceptions done in that way as I want to make it look a lot better for the end users, while still being able to get to the stack trace if really needed. For most of these exceptions (vault not found, invalid file name, etc) the stack trace doesn't have any meaning anyway. Actually for all the exceptions that we catch that should be the case, as the software behaves as expected in those situations. And all non-handled exceptions will have a stack trace no matter what.

We may consider having a constant defined say DEVELOPMENT = True at the start of the script, and then dump a stack trace based on this key. Setting it to False when an actual release is done.

offlinehacker commented 11 years ago

Yes that flag would be cool, i support ;)

I've also committed almost finished, but completely untested new upload implementation available on my github(don't even try to run it it won't start), but you can see the core ideas(function _upload in GlacierWrapper and class Part in corecalls). Completely same code can upload using multiprocessing and without it, using mutiprocessing.imap or itertools.imap and some "hacks" behind it. It has proper mmap support and allows to resume if data comes from nonsequential and sequential input(by switching to reading instead of mmaping and disabling multiprocessing upload). If you have some notes on what additional exceptions should i take care of please do tell me.

I will hopefully finish it tomorrow(without tests, which will come in later days) also taking quite some code from this commit.

skin commented 11 years ago

@ wvmarle Hi, i tried to use your parallel-upload branch but it seems to have some problem with files greater than 2GB. It's something related to mmap.mmap call on glaciercorecalls.py :

part = mmap.mmap(fileno=f.fileno(), length=stop-start offset=start, access=mmap.ACCESS_READ)

I guess this error is similar to that one https://github.com/uskudnik/amazon-glacier-cmd-interface/pull/99

wvmarle commented 11 years ago

Yes, same issue.