uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

Too much CPU used when using multipart - should have a way to throttle upload speed ? #32

Closed farialima closed 12 years ago

farialima commented 12 years ago

This may not be easy to fix, but it's feedback, never bad to give...

I'm using glacier-cmd-interface to upload from DreamHost shared hosting to Amazon. however, for files that are bigger than 100MB I get:

(ve)[tigers]$ amazon-glacier-cmd-interface/glacier/glacier.py upload my_backups my_backup_file

Yikes! One of your processes (python, pid 14861) was just killed for excessive resource usage.                                                                                  
Please contact DreamHost Support for details.

Killed
(ve)[tigers]$ 

If the file is less than 100MB things are OK.

The process is killed while in:


    def make_request(self, method, path, headers=None, data='', host=None,
                     auth_path=None, sender=None, override_num_retries=None):
        headers = headers or {}
        headers.setdefault("x-amz-glacier-version","2012-06-01")
        return super(GlacierConnection, self).make_request(method, path, headers,
                                                           data, host, auth_path,
                                                           sender, override_num_retries)

So it may be that we are sending too much / too fast. I've tried to throtle CPU usage, but to no avail.

I would suggest to add a way to throttle the upload speed (as an option): I would suppose it would fix this, and be useful for many people (you don't want backup upload to take all the bandwidth...)

Probably not easy to implement - but who know...

Since this library seems very useful, I thought it was worth reporting any issue I have ! thank you for this lib.

gburca commented 12 years ago

If you want to throttle upload speed (and you're in control of the machine, and it's running some flavor of Linux, and etc...), take a look at /sbin/tc. It's not the most user-friendly tool out there, but it's very powerful. With a little bit of scripting you can run it before you start the glacier upload and it's probably the most effective way to throttle your bandwidth. For some inspiration, here's the relevant portion from the script I use:

TC=/sbin/tc
IF=eth0
REGION="us-east-1"
IP=`dig +short +answer "glacier.${REGION}.amazonaws.com" A | grep -v '\.$' | tr '\n' ' '`
U32="$TC filter add dev $IF protocol ip parent 1:0 prio 1 u32"

$TC qdisc add dev $IF root handle 1: htb default 30
$TC class add dev $IF parent 1: classid 1:2 htb rate 200kbps
for ip in $IP; do
    $U32 match ip dst $ip/32 flowid 1:2
done

And to remove the filtering

$TC qdisc del dev $IF root

The nice part is that this technique works for any application not just the glacier command line tool.

offlinehacker commented 12 years ago

Well solution by @gburca is cool and i think it should solve most of the problems, but still we might implement speed throttle once there won't be any more important bugs to solve, so let's leave this ticket open.

wvmarle commented 12 years ago

Looking back at this issue I suspect it had to do with memory use rather than upload speed (the original upload code would use 4-5 times block size - so files >100 MB would eat up 400-500 MB of RAM - not surprising a cloud host would baulk at such a resource demand).

For throttling upload speed: at this moment glacier-cmd supports only a single upload thread at a time (now that could be an enhancement: allowing multiple uploads in parallel). It will use only as much speed as the system allows. Besides that I have no idea on how to throttle speeds, I think this would have to be done in boto, which is where the data is actually sent out.

uskudnik commented 12 years ago

Yup, since we are migrating to boto this will most probably have to be done in the boto itself. Whether or not they will accept this or if they want this, I have no idea.

I did a bit of research on the subject and it appears it can be done, but seems to be a bit complicated.

See http://stackoverflow.com/questions/456649/throttling-with-urllib2 and http://pastie.org/3120175.

It also appears twisted can do it but I would rather not mix Twisted into the equation if we can do it on our own. http://twistedmatrix.com/documents/10.1.0/api/twisted.protocols.policies.ThrottlingFactory.html

wvmarle commented 12 years ago

I just had a quick look at the sources, and I think it'd be rather easy to implement because basically what they do is "send some data, wait a bit, send some more data, wait again" so that the overall rate is within a limit. We could do the same: send a part of data, wait a bit, send another part of data. But then you're not really limiting the rate, you're sending in bursts, saturating your pipe part of the time, sending nothing the rest of the time. The key question: is this useful? Is it worth the effort? Should we attempt it to begin with? Normally people want as fast an upload as possible. And as I said I suspect it's the memory it took, not the transfer speed - farialima could confirm this. TC I'm sure is the best solution if you want to limit speeds. A bitch to set up, but overall more flexible, and it's designed to do just that. Figure out how it can be done, add example to the docs, forget about it. I think those who truly need it will be able to figure it out.

uskudnik commented 12 years ago

Example in the doc will do. Thats completely in line with the whole Linux philosophy of having one tool for the job and in that fashion TC gives a lot more flexibility to our users that we could ever provide.