tolitius / cbass

adding "simple" to HBase
Eclipse Public License 1.0
24 stars 11 forks source link

scan returns single CF #10

Open andy-innersource opened 6 years ago

andy-innersource commented 6 years ago

Hi,

I have been using this library for a while and it is awesome. However, there are a few things which either could be improved or complicate life a lot.

The original code for some reason in scan only returns a single CF. I use following code in my version to return all CFs

(defn hdata->map* [^Result data]
  (when-let [r (.getRow data)]
    (into {} (for [[k vs] (.getNoVersionMap data)]
               [(keyword (String. ^bytes k))
                (into {} (for [[kv vv] vs]
                           [(keyword (String. ^bytes kv))
                            (@unpack vv)]
                           ))]
               ))
    ))

I wonder how we could extend API to have above capability?

Also, in scan itself we prefer vectors are resulting structure

      (cond->> rmap
        (not with-ts?) (without-ts)
        (not lazy?) (into [])))))

A.

tolitius commented 6 years ago
  1. it's been some time since I had to look at this, can you give some data examples (i.e. based on examples in README) with a single CF returned where you'd expect multiple ones to return?
  2. is there a reason you prefer vectors?
andy-innersource commented 6 years ago
  1. if in lazy? case we return a seq, I found it weird to have a map as result otherwise
andy-innersource commented 6 years ago

1. in shell:

hbase(main):001:0> create "galaxy:planet","stat","satellite"
Created table galaxy:planet

in repl:

user=> (store conn "galaxy:planet" "earth" "stat" {:inhabited? true :population 7125000000 :age "4.543 billion years"})
nil
user=> (store conn "galaxy:planet" "earth" "satellite" {:artificial ["spy1" "comm1"] :natural ["moon"]})
nil
user=> (scan conn "galaxy:planet")
{"earth" {:artificial ["spy1" "comm1"], :natural ["moon"]}}

whole stat CF is missing in the result

andy-innersource commented 6 years ago

while my version returns

user=> (scan* conn "galaxy:planet")
[["earth" {:satellite {:artificial ["spy1" "comm1"], :natural ["moon"]}, :stat {:age "4.543 billion years", :inhabited? true, :population 7125000000}}]]
(defn hdata->map* [^Result data]
  (when-let [r (.getRow data)]
    (into {} (for [[k vs] (.getNoVersionMap data)]
               [(keyword (String. ^bytes k))
                (into {} (for [[kv vv] vs]
                           [(keyword (String. ^bytes kv))
                            (@unpack vv)]
                           ))]
               ))
    ))

(defn results->maps* [results row-key-fn]
  (for [^Result r results]
    [(row-key-fn (.getRow r))
     (hdata->map* r)]))

(defn scan* [conn table & {:keys [row-key-fn limit with-ts? lazy?] :as criteria}]
  (with-open [^HTableInterface h-table (get-table conn table)]
    (let [results (-> (.iterator (.getScanner h-table (scan-filter criteria)))
                      iterator-seq)
          row-key-fn (or row-key-fn #(String. %))
          rmap (results->maps* (if-not limit
                               results
                               (take limit results))
                             row-key-fn)]
      (cond->> rmap
        (not with-ts?) (without-ts)
        (not lazy?) (into [])))))
tolitius commented 6 years ago
  1. ah, good point. besides it should be a vector of maps, not vector of vectors as per 2.

  2. has to do with the fact that there is no lazy map, but there is a lazy seq: i.e. we can't return a lazy map of results.

I agree that lazy? is inconsistent by being a vector rather than a map, but it is best to work with maps since they are explicitly addressable / navigatable.

lazy was added later and I remember thinking about returning chunk maps from lazy? (i.e. stream it). there might be different ideas, but maps are far more superior to work with than vectors of vectors.

andy-innersource commented 6 years ago

Of course, I see the value of returning map but IMHO I would split scan into two functions offering different semantics and get rid of the lazy? thing. lazy? implies longs sequences. I do have indeed two use cases in my code: quick lookups into a map as well as scanning thru millions of rows.

"1." is bigger deal for me. I can live with my version of scan* for now though (it is quick and dirty, waiting for an ultimate solution)

tolitius commented 6 years ago

yea, feel free to submit a pull request. as long as it does not break the existing API (besides returning results for multiple CFs) I see no reason it would not be merged