techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
680 stars 35 forks source link

Documentation and the actual behavior of `select` do not match. #387

Closed lkyhfx closed 9 months ago

lkyhfx commented 10 months ago

Run the following example

(ds/select (ds/->dataset [{:a 1 :b 2} {:a 2 :b 3} {:a 3 :b 4}]) :all  [true true true])

This produces:

| :a | :b |
|---:|---:|
|  2 |  3 |
|  2 |  3 |
|  2 |  3 |

However, the documentation for select states:

selection - either keyword :all, a list of indexes to select, or a list of booleans where
    the index position of each true value indicates an index to select. When providing indices,
    duplicates will select the specified index position more than once.

When passing a boolean list, the position of true will be selected. Therefore, it should produce:

| :a | :b |
|---:|---:|
|  1 |  2 |
|  2 |  3 |
|  3 |  4 |

Is the documentation outdated?

harold commented 10 months ago

Hi! Thanks for the issue - very interesting one.

Maybe something did change here, because something similar came up recently on zulip: https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/select.20via.20boolean.20broken.3F

In your case here, you can get the result you expect by not using a clojure vector:

user> (require '[tech.v3.dataset :as ds])
nil
user> (def ds (ds/->dataset [{:a 1 :b 2} {:a 2 :b 3} {:a 3 :b 4}]))
#'user/ds
user> (ds/select ds :all  [true true true])
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|  2 |  3 |
|  2 |  3 |
|  2 |  3 |
user> (ds/select ds :all  (dtype/make-list :boolean [true true true]))
_unnamed [3 2]:

| :a | :b |
|---:|---:|
|  1 |  2 |
|  2 |  3 |
|  3 |  4 |

As we discovered in that zulip exchange, the types are determined in a fairly specific way: https://github.com/techascent/tech.ml.dataset/blob/cdb71cf623f426a2b5e883aeb355377892944bbb/src/tech/v3/dataset/impl/column.clj#L198

So, ...

user> (dtype/elemwise-datatype [true true true])
:object
user> (dtype/elemwise-datatype (dtype/make-list :boolean [true true true]))
:boolean

Would be interested to know what @cnuernber thinks about this now that it has come up twice. (:

harold commented 9 months ago

PR: https://github.com/techascent/tech.ml.dataset/pull/390

Lots of options here, not sure this is the best, but we can iterate from here... (: