peterknife / boto

Automatically exported from code.google.com/p/boto
0 stars 0 forks source link

Batch update for sdb #226

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. I don't want to hit the webservice for each attribute I'm setting
2. So I hooked in batch_update (that was already there) for item
3.

What is the expected output? What do you see instead?

Being able to use the dict method, update to update item attributes.

Please use labels and text to provide additional information.

Here's a patch:

$ svn diff
Index: boto/connection.py
===================================================================
--- boto/connection.py  (revision 1120)
+++ boto/connection.py  (working copy)
@@ -503,6 +503,8 @@
         params['Timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S",
time.gmtime())
         qs, signature = self.get_signature(params, verb, path)
         qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
+        print 'signature', action, signature, params, verb, path
+        print 'params', params
         if self.use_proxy:
             qs = self.prefix_proxy_to_path(qs)
         return self._mexe(verb, qs, None, headers)
Index: boto/sdb/item.py
===================================================================
--- boto/sdb/item.py    (revision 1120)
+++ boto/sdb/item.py    (working copy)
@@ -105,6 +105,15 @@
             self.domain.delete_attributes(self.name, [key])
         del self._dict[key]

+    def update(self, other_dict):
+        if self._dict == None:
+            self.load()
+        if self.active:
+            # domain requires a mapping of item name to the dictionary
+            # (so it can update multiply items at once)
+            self.domain.batch_put_attributes({self.name:other_dict})
+        self._dict.update(other_dict)
+
     def keys(self):
         if self._dict == None:
             self.load()
Index: boto/sdb/domain.py
===================================================================
--- boto/sdb/domain.py  (revision 1120)
+++ boto/sdb/domain.py  (working copy)
@@ -55,6 +55,10 @@
     def put_attributes(self, item_name, attributes, replace=True):
         return self.connection.put_attributes(self, item_name, attributes,
replace)

+    def batch_put_attributes(self, other_dict, replace=True):
+        return self.connection.batch_put_attributes(self, other_dict,
replace=replace)
+
+
     def get_attributes(self, item_name, attribute_name=None, item=None):
         return self.connection.get_attributes(self, item_name,
attribute_name, item)

Original issue reported on code.google.com by matthewh...@gmail.com on 21 Apr 2009 at 11:00

GoogleCodeExporter commented 9 years ago
sorry ignore the connection.py debug statements :)

Original comment by matthewh...@gmail.com on 21 Apr 2009 at 11:01

GoogleCodeExporter commented 9 years ago
Thanks for the patch.  The one question I have is in the update method.  You are
using the batch put but I don't think there is really any need to do that.  A 
normal
put_attributes call will work just fine since you are only updating a single 
item.

Or am I missing something?

Original comment by Mitch.Ga...@gmail.com on 22 Apr 2009 at 12:15

GoogleCodeExporter commented 9 years ago
Most likely I'm missing something.  Yeah, put_attributes would probably work 
too :)

On that note I'm going to be have some domains with millions of entries.  
What's the
best way to load that data?  (Am assumming batching up many (10K or so) items 
and
sending them (perhaps many batches split among threads)).  If that doesn't 
exist,
I'll probably want it within the week ;)  So if you have any hints, that'd be 
great,
else I'll implement my own and send the patch.

Original comment by matthewh...@gmail.com on 22 Apr 2009 at 12:34

GoogleCodeExporter commented 9 years ago
I just checked with a friend who has done a lot of bulk uploads to SDB.  He 
confirmed
that the best approach is firing up multiple threads (the right number depends 
on
your upload bandwidth, among other things) each performing bulk_put_attributes 
commands.

There is still a threaded query method in connection.py but I haven't done a 
threaded
uploader.  I would probably actually use the subprocess module if you are on a 
linux
platform and using python 2.5 or better.  I've never been a big fan of threads 
but
YMMV.  I might have some sample code from another project that would help.  
I'll check.

Original comment by Mitch.Ga...@gmail.com on 22 Apr 2009 at 1:29

GoogleCodeExporter commented 9 years ago
batch_put is a bit broken at the moment:

def batch_put_attributes(self, items, replace=True):
    return self.connection.put_attributes(self, item_name, attributes, replace)

item_name and attributes are not readily available (should be .keys() and 
.values() ?)

Original comment by attila.c...@gmail.com on 28 Apr 2009 at 3:15

GoogleCodeExporter commented 9 years ago
attila is correct

here's my patch:

--- boto/sdb/domain.py  (revision 1125)
+++ boto/sdb/domain.py  (working copy)
@@ -91,7 +91,7 @@
         @rtype: bool
         @return: True if successful
         """
-        return self.connection.put_attributes(self, item_name, attributes, 
replace)
+        return self.connection.batch_put_attributes(self, items, replace)

     def get_attributes(self, item_name, attribute_name=None, item=None):
         """

Original comment by matthewh...@gmail.com on 28 Apr 2009 at 9:52

GoogleCodeExporter commented 9 years ago
thanks.  Stupid copy/paste error.  Hurrying too much.  Fixed in r1126.

Original comment by Mitch.Ga...@gmail.com on 28 Apr 2009 at 10:01

GoogleCodeExporter commented 9 years ago
It's still wrong.

You need to invoke .batch_put_attributes
instead of .put_attributes

Original comment by matthewh...@gmail.com on 28 Apr 2009 at 10:29

GoogleCodeExporter commented 9 years ago
good grief.

okay, I slowed down a little and added a simple test to 
tests/test_sdbconnection.py
that will at least make sure the domain.batch_put_attributes method is 
functional.

Original comment by Mitch.Ga...@gmail.com on 28 Apr 2009 at 10:44

GoogleCodeExporter commented 9 years ago
;)

Now I'm running into the problem where I'm trying to call .batch_put_attributes 
and
it gives me a broken pipe.  Uploading one at a time works, but is slow for 80K 
test
data set....

Original comment by matthewh...@gmail.com on 28 Apr 2009 at 10:46

GoogleCodeExporter commented 9 years ago
Figured out my problem.  There is a limit of 25 items per batch_put_attributes
http://docs.amazonwebservices.com/AmazonSimpleDB/2007-11-07/DeveloperGuide/index
.html?SDB_API_BatchPutAttributes.html

Sometimes I got 'Broken pipe' error, then I lowered the number of items I got 
and
Amazon told be I was trying to insert too many.

Here's a patch to help client developers figure this out faster:

+++ boto/sdb/connection.py      (working copy)
@@ -261,6 +261,9 @@
         @rtype: bool
         @return: True if successful
         """
+        if len(items) > 25:
+            raise boto.BotoClientError("Can only insert 25 items in
BatchPutAttributes (trying to insert %d)" %len(items))
+
         domain, domain_name = self.get_domain_and_name(domain_or_name)
         params = {'DomainName' : domain_name}
         self.build_batch_list(params, items, replace)

Original comment by matthewh...@gmail.com on 28 Apr 2009 at 11:04

GoogleCodeExporter commented 9 years ago
Hmmmm, can anyone else confirm that batch_put_attributes works?

I can call it and it appears to return successfully yet my domain reports that 
it is
empty....

BTW here's a better version of the last patch

--- boto/sdb/connection.py      (revision 1128)
+++ boto/sdb/connection.py      (working copy)
@@ -30,6 +30,9 @@
 from boto.exception import SDBResponseError
 from boto.resultset import ResultSet

+MAX_PUT_ATTRIBUTES = 256
+MAX_BATCH_PUT_ITEMS = 25
+
 class ItemThread(threading.Thread):

     def __init__(self, name, domain_name, item_names):
@@ -233,6 +236,8 @@
         @rtype: bool
         @return: True if successful
         """
+        if len(attributes) > MAX_PUT_ATTRIBUTES:
+            raise boto.BotoClientError("Can only insert %d attributes in
PutAttributes (trying to insert %d)" % (MAX_PUT_ATTRIBUTES, len(attributes)))
         domain, domain_name = self.get_domain_and_name(domain_or_name)
         params = {'DomainName' : domain_name,
                   'ItemName' : item_name}
@@ -261,6 +266,9 @@
         @rtype: bool
         @return: True if successful
         """
+        if len(items) > MAX_BATCH_PUT_ITEMS:
+            raise boto.BotoClientError("Can only insert %d items in
BatchPutAttributes (trying to insert %d)" %(MAX_BATCH_PUT_ITEMS, len(items)))
+
         domain, domain_name = self.get_domain_and_name(domain_or_name)
         params = {'DomainName' : domain_name}
         self.build_batch_list(params, items, replace)

Original comment by matthewh...@gmail.com on 29 Apr 2009 at 2:24

GoogleCodeExporter commented 9 years ago
Hmm. Seems to be working for me.

In [1]: import boto

In [2]: c = boto.connect_sdb()

In [3]: d = c.lookup('my_domain')
send: 'GET
/?AWSAccessKeyId=0CZQCKRS3J69PZ6QQQR2&Action=Query&DomainName=my_domain&MaxNumbe
rOfItems=1&QueryExpression=&SignatureMethod=HmacSHA256&SignatureVersion=2&Timest
amp=2009-04-29T02%3A31%3A36&Version=2007-11-07&Signature=uI9sYKluAC3h0raZpqiaSm5
ZILb4tp9Ht/fIOPV%2B20E%3D
HTTP/1.1\r\nHost: sdb.amazonaws.com:443\r\nAccept-Encoding: 
identity\r\nUser-Agent:
Boto/1.7a (darwin)\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: text/xml
header: Transfer-Encoding: chunked
header: Date: Wed, 29 Apr 2009 02:31:36 GMT
header: Server: Amazon SimpleDB

In [4]: item3 = {'name3_1' : 'value3_1',
   ...: 'name3_2' : 'value3_2',
   ...: 'name3_3' : ['value3_3_1', 'value3_3_2']}

In [5]: item4 = {'name4_1' : 'value4_1',
   ...: 'name4_2' : ['value4_2_1', 'value4_2_2'],
   ...: 'name4_3' : 'value4_3'}

In [6]: items = {'item3' : item3, 'item4' : item4}

In [7]: d.batch_put_attributes(items)
send: 'GET
/?AWSAccessKeyId=0CZQCKRS3J69PZ6QQQR2&Action=BatchPutAttributes&DomainName=my_do
main&Item.0.Attribute.0.Name=name3_2&Item.0.Attribute.0.Replace=true&Item.0.Attr
ibute.0.Value=value3_2&Item.0.Attribute.1.Name=name3_3&Item.0.Attribute.1.Replac
e=true&Item.0.Attribute.1.Value=value3_3_1&Item.0.Attribute.2.Name=name3_3&Item.
0.Attribute.2.Replace=true&Item.0.Attribute.2.Value=value3_3_2&Item.0.Attribute.
3.Name=name3_1&Item.0.Attribute.3.Replace=true&Item.0.Attribute.3.Value=value3_1
&Item.0.ItemName=item3&Item.1.Attribute.0.Name=name4_1&Item.1.Attribute.0.Replac
e=true&Item.1.Attribute.0.Value=value4_1&Item.1.Attribute.1.Name=name4_3&Item.1.
Attribute.1.Replace=true&Item.1.Attribute.1.Value=value4_3&Item.1.Attribute.2.Na
me=name4_2&Item.1.Attribute.2.Replace=true&Item.1.Attribute.2.Value=value4_2_1&I
tem.1.Attribute.3.Name=name4_2&Item.1.Attribute.3.Replace=true&Item.1.Attribute.
3.Value=value4_2_2&Item.1.ItemName=item4&SignatureMethod=HmacSHA256&SignatureVer
sion=2&Timestamp=2009-04-29T02%3A33%3A32&Version=2007-11-07&Signature=OJDB41JOo%
2BLClB21Gf7sRcdQf50UBKhTsZhHPZTCYaI%3D
HTTP/1.1\r\nHost: sdb.amazonaws.com:443\r\nAccept-Encoding: 
identity\r\nUser-Agent:
Boto/1.7a (darwin)\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: text/xml
header: Transfer-Encoding: chunked
header: Date: Wed, 29 Apr 2009 02:33:31 GMT
header: Server: Amazon SimpleDB
Out[7]: True

Original comment by Mitch.Ga...@gmail.com on 29 Apr 2009 at 2:35

GoogleCodeExporter commented 9 years ago
hmmmm, can you try running my attached testcase?  It appears to run.  But when I
query the domain (the testcase does this), it says it's empty....

Original comment by matthewh...@gmail.com on 29 Apr 2009 at 3:07

Attachments:

GoogleCodeExporter commented 9 years ago
Hi Mitch,
How did you set your ipython to display what has been sent? Sorry not very boto
specific comment.
Thanks

Original comment by norman.k...@gmail.com on 6 May 2009 at 9:42

GoogleCodeExporter commented 9 years ago
The Item.update() method was introduced here and it's "inventor" wanted "to be 
able
to use the dict method update() to update item attributes".

Currently, it's implemented as follows (boto 1.8.d):

def update(self, other_dict):
        if self._dict == None:
            self.load()
        if self.active:
            self.domain.put_attributes(self.name, self, replace)
        self._dict.update(other_dict)

Hence, when an Item is update()ted with an other_dict, this change is not done
remotely, even in active=True state. I don't see the sense of
`self.domain.put_attributes(self.name, self, replace)` (which could be 
shortened by
self.save()) *before* the actual dictionary updating. So, I would suggest to do 
it
this way:

def update(self, other_dict):
        if self._dict == None:
            self.load()
        self._dict.update(other_dict)
        if self.active:
            self.save()

Does this make sense / did I oversee something?

Thank you,

Jan-Philip Gehrcke

Original comment by jgehr...@googlemail.com on 21 Jul 2009 at 8:50