taoensso / faraday

Amazon DynamoDB client for Clojure
https://www.taoensso.com/faraday
Eclipse Public License 1.0
238 stars 84 forks source link

Is the scan function safe for large tables? #112

Closed brosenan closed 5 years ago

brosenan commented 7 years ago

Is the scan function safe to use with extremely large tables? I see from the code that it handles pagination by concatenating paginated results into a single sequence using merge-more, but I couldn't figure out from reading the code whether this sequence is lazy.

If this sequence is not lazy there are of-course two major drawbacks when using it with large tables:

  1. Scan will return after consuming the entire table, or the entire segment (if assigned) and
  2. Scan will consume all available memory and crash when no more memory is left.

I believe the sequence is lazy, just wanted to make sure.

Thanks, Boaz

geekingfrog commented 7 years ago

scan will not actually scan the whole table. It will return a (strict) array of elements, with some metadata. One of the meta-data key is last-prim-kvs. If this is not null, it means the scan isn't complete and you'll have to call scan again supplying this last-prim-kvs to continue scanning.

I have code looking a bit like that:

(defn scan-table
  [callback]
  (loop [kvs :none]
    (let [query (if (= :none kvs) {} {:last-prim-kvs kvs})
          results (faraday/scan opts table-name query)
          next-kvs (:last-prim-kvs (meta results))
          ]
      (callback results)
      (when-not (nil? next-kvs) (recur next-kvs)))))

Don't forget to eventually add a :limit option and use a rate limiter to avoid blowing up through your provisionned capacity.

rwilson commented 7 years ago

Here's a similar approach to @geekingfrog, but operating semi-lazily:

(defn lazy-scan
  ([client-opts table] (lazy-scan client-opts table nil))
  ([client-opts table opts]
   (lazy-seq
    (let [results (faraday/scan client-opts table opts)
          next-kvs (:last-prim-kvs (meta results))]
      (if next-kvs
        (lazy-cat results (lazy-scan client-opts table (assoc opts :last-prim-kvs next-kvs)))
        results)))))

I say semi-lazily, because production will stay ahead of consumption, but in chunks related somewhat to whatever initial :limit value is specified in opts.

belucid commented 5 years ago

@brosenan and @rwilson are we satisfied here? Can this one be closed? Do you suggest something be adding to the docs around scan?

rwilson commented 5 years ago

I think it can be closed; it's reasonably well documented already via the AWS docs and the :limit parameter.

belucid commented 5 years ago

Sounds good @rwilson

green-coder commented 5 years ago

I think that the original poster's question have been misunderstood. He went to read the source code, found the function merge-more, expressed concerns about what it does, but nobody answered about that function w.r.t. lazyness.

green-coder commented 5 years ago

@brosenan To get just a piece of the query's result, you need to specify {:limit n, :span-reqs {:max m}} in the options. You will get up to (* n m) items in your results, i.e. m requests of up to n items concatenated together in the result.

By default, (-> options :span-reqs :max) is set to 5 in the current version of Faraday (1.9.0).