Open natecode opened 8 years ago
@stevearc This sounds good. I like that gen() would be implemented in terms of page(). Hopefully page size would be configurable as well in the first call.
I ended up going about it in a slightly different way than initially planned, but it's pretty similar.
You can see an example of what the flow would look like in this test case
You should be able to configure the query behavior to do what you want with the Limit class. It's in dynamo3, but it's also imported in flywheel so you can import it directly from there.
Released flywheel 0.4.6. Let me know if this works for you!
I ran some tests on this code, and it seems to be basically working. I found some issues with the current approach I’m not sure how to handle.
Your test code shows how to use pkdict to get the primary key from the last item in a result set in order to restart the scan at that point. However, for a query on an index, it looks like I need to specify all the keys (and includes) from that index. How do I get the equivalent of pkdict for an index?
I’m running a query of an index with 4 matching items, with a Limit(scan_limit=2, strict=True). Debug log shows that each query is submitted properly and the results are what I expect, but the LastEvaluatedKey in the response for the 2nd query is non-null. (It is the last item in that result, item #4). According to the docs, it should be null instead when you hit the end of the results.
Sequence:
\ I expect LastEvaluatedKey to be null here
If I continue with the query, the 3rd query gets a response of 0 items, so my code terminates then. But it would be nice not to have an extra round trip each time just to find there are no more results. I’m not sure if this is a DDB issue or Flywheel.
Here’s my rough workaround:
last_evaluated_key = None
while True:
# This calls .filter().limit().all() on the index
versions = app.get_items(engine, last_evaluated_key)
for x in versions:
print x.id
# Catch partial or no results = done
if len(versions) < PAGE_SIZE:
break
# Get primary key for index query to resume
dump = x.ddb_dump_()
last_evaluated_key = {'id': dump['id'], 'oid': dump['oid'], 'date': dump['date']}
I assume item_limit is when you want a specific number of items and you don’t care how many per Query/Scan operation, nor do you want to handle pagination. In the scan_limit case, you want to control the number of items fetched. Is that right?
Thanks!
Limit
and it happens to hit the limit right at the final item, it will return a LastEvaluatedKey
.
¯\_(ツ)_/¯
scan_limit
is passed in directly to DynamoDB as the Limit
parameter. A careful reading of the docs reveals that Limit
doesn't do what you would expect, nor probably what you want most of the time. It sets a hard limit on the items scanned, which does not mean it will be the number of items returned. If you don't have any scan filters then it behaves as expected. If you have any scan filters it's possible for the query/scan to return no results even if some exist in the table, because Dynamo will scan up to the Limit
and then return whatever it found.This is why I added item_limit
. If you pass in an item_limit
dynamo3 will continue to query DynamoDB until either the item_limit
is reached OR there are no more results in the DB. min_scan_limit is there for the case when you're fetching a small item_limit
(say, 1). If you set the Limit to just be 1, you may have to do many many queries to finally retrieve a result. min_scan_limit
is there to make sure you're always scanning a minimum amount of items. Since your results only come back in pages, the total results may exceed the item_limit
(ex. you only need 1 more, you make another query and get back 5 results). By default it will return all of the results, but you can pass in strict=True
to chop off the extras.
I hope this clears up some of the confusion. I still think item_limit
isn't a great name, so if you have suggestions for a better one please let me know. Aaaaand I'll look for somewhere to put a method that will construct primary keys for indexes.
I actually had to pass in an ExclusiveStartKey
comprised of 3 fields in the case of an includes index that had a hash key, range key, and one other included field. That surprised me since I thought it would just need the hash/range key for the index. Including anything less than those 3 fields gives this exception:
File "bluesteel/models.py", line 347, in <module>
main()
File "bluesteel/models.py", line 345, in main
last_key = {'oid': x.oid, 'date': x.ddb_dump_()['date']}
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/flywheel/query.py", line 79, in gen
for result in results:
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/six.py", line 558, in next
return type(self).__next__(self)
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/result.py", line 254, in __next__
return six.next(self.iterator)
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/result.py", line 283, in fetch
data = self.connection.call(*self.args, **self.kwargs)
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/connection.py", line 230, in call
exc.re_raise()
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/exception.py", line 22, in re_raise
six.reraise(type(self), self, self.exc_info[2])
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/connection.py", line 220, in call
data = op(**kwargs)
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/botocore/client.py", line 310, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/botocore/client.py", line 396, in _make_api_call
raise ClientError(parsed_response, operation_name)
dynamo3.exception.DynamoDBError: ValidationException: Exclusive Start Key must have same size as table's key schema
Args: {
'ExclusiveStartKey': {'date': {'N': u'1450915200.000000'},
'oid': {'S': u'av-J4DFNhwnbDf'}},
...
Ignore the lazy use of ddb_dump_()
as this was just prototype code.
This index was defined as:
__metadata__ = {
'global_indexes': [
GlobalIndex.include('appversions-for-app', 'app_id', 'date',
includes=['oid']),
],
}
Good explanation on the various limits. That should definitely go in the docs. Maybe call it desired_page_size
or something?
I may not be reading this correctly, but from the example you gave it looks like you were passing in date
and old
, when you needed to pass in app_id
and date
. Have you tried those two and was the result the same? I tried it on a global index locally and it seemed to work fine.
After a fantastic test case by Nate and some digging, it appears that the ExclusiveStartKey
for an index query (global and local) needs to contain both the primary key of the index and the primary key of the table. This isn't documented anywhere I could find in the Dynamo docs because they just say "use the thing we give you as LastEvaluatedKey
" and assume that you won't be constructing it yourself.
I'll account for that in the method I'm making for grabbing the primary key from an index and it should Just Work™
Very nice, thanks for tracking this down.
Okay, I pushed out version 0.4.7
which has the index_pk_dict_()
method on the model, and index_pk_dict()
on the metadata object. See if that works for you.
Sorry, I’m getting a crash when using an index that has a datetime as part of the key. The key looks like this coming back from index_pk_dict() ok:
{'date': datetime.datetime(2015, 12, 20, 0, 0, tzinfo=<flywheel.fields.types.UTCTimezone object at 0x110c6c250>), 'oid': u'av-VmogFmEt6ey', 'app_id': u'app-IzGMSIS0p0m’}
But then the datetime is not being encoded, resulting in this exception:
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/flywheel/query.py", line 115, in all exclusive_start_key=exclusive_start_key)) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/flywheel/query.py", line 78, in gen **kwargs) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/connection.py", line 1132, in query self.dynamizer.maybe_encode_keys(exclusive_start_key) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/types.py", line 170, in maybe_encode_keys ret[k] = self.encode(v) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/types.py", line 175, in encode return dict([self.raw_encode(value)]) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/types.py", line 156, in raw_encode (value, type(value))) ValueError: No encoder for value '2015-12-20 00:00:00+00:00' of type '<type 'datetime.datetime’>'
If you're calling the index_pk_dict()
method on the metadata object
directly, ddb_dump
defaults to False
. Try passing in db_dump=True
or
using the index_pk_dict_()
method on the model itself
On Thu, Dec 17, 2015 at 10:52 AM natecode notifications@github.com wrote:
Sorry, I’m getting a crash when using an index that has a datetime as part of the key. The key looks like this coming back from index_pk_dict() ok:
{'date': datetime.datetime(2015, 12, 20, 0, 0, tzinfo=<flywheel.fields.types.UTCTimezone object at 0x110c6c250>), 'oid': u'av-VmogFmEt6ey', 'app_id': u'app-IzGMSIS0p0m’}
But then the datetime is not being encoded, resulting in this exception:
File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/flywheel/query.py", line 115, in all exclusive_start_key=exclusive_start_key)) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/flywheel/query.py", line 78, in gen **kwargs) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/connection.py", line 1132, in query self.dynamizer.maybe_encode_keys(exclusive_start_key) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/types.py", line 170, in maybe_encode_keys ret[k] = self.encode(v) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/types.py", line 175, in encode return dict([self.raw_encode(value)]) File "/Users/nate/venvs/orangecrush/lib/python2.7/site-packages/dynamo3/types.py", line 156, in raw_encode (value, type(value))) ValueError: No encoder for value '2015-12-20 00:00:00+00:00' of type '<type 'datetime.datetime’>'
— Reply to this email directly or view it on GitHub https://github.com/mathcamp/flywheel/issues/34#issuecomment-165546356.
I’d like to be able to provide an API to our users that is backed by a query that returns a large number of results. We’d like to be able to retrieve and store a cursor from a result set (query or scan), then create a new result set that picks up where the last one left off, possibly minutes later. I know that dynamo3 supports this internally, but it's not exported via flywheel yet.
Steve replied via email:
I think you'd have to have a new terminator on the query (like gen(), all()). I think you'd need page() to fetch a single page from DynamoDB so that you have some knowledge of where the database query left off. The default gen() implementation just provides a straight-up iterator that handles it all transparently. We could make page() return an object that extends list and has a 'cursor' attribute. Then you make the query terminators also take 'cursor' as a keyword argument, pass that through to dynamo3 and it should just work. I'll have to make a couple of changes in dynamo3 to support fetching pages instead of just iterators, but that should be pretty easy.