scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
305 stars 27 forks source link

separate-columns with default target naming #78

Closed genmeblog closed 2 years ago

genmeblog commented 2 years ago

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/seperate.20with.20custom.20fn

genmeblog commented 2 years ago

This will be a breaking change (minor). By default source column will be replaced by the new one, on every case.

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |    2 |    3 |    9 |   10 |   11 |   22 |   33 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y reverse))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |   33 |   22 |   11 |   10 |    9 |    3 |    2 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y (fn [input]
                             (zipmap "somenames" input))))
;; => _unnamed [1 7]:
;;    | :x |  a | s |  e |  m |  n | o |
;;    |---:|---:|--:|---:|---:|---:|--:|
;;    |  1 | 22 | 2 | 10 | 33 | 11 | 3 |
behrica commented 2 years ago

I am know wondering if this use case should be handled by "tc/seperate-column" or if it requires a complete new method, for performance reasons. The seq in your example [2 3 9 10 11 22 33] could be as well a double arrays, like this:

(def ds
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

And to separate this (specialy when large) could be done optimized in this way:

(->
 (tech.v3.datatype/concat-buffers (:y ds))
 (tech.v3.tensor/reshape [(tc/row-count ds)
                          (-> ds :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

(+ replacing the column: y in the ds with the news ds)

I suppose this is significantly faster then a generic "separate" implementation you have intc/seperate It works as well for the persistent vector case above

test cases could be those:

(def ds-1
  (-> (tc/dataset {:x [1 2] :y [[2 3 9 10 11 22 33]
                                [2 3 9 10 11 22 33]]})))

(def ds-2
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

(def ds-3
  (-> (tc/dataset {:x [1] :y [(list 2 3 9 10 11 22 33)]})))

(->
 (tech.v3.datatype/concat-buffers (:y ds-1))
 (tech.v3.tensor/reshape [(tc/row-count ds-1)
                          (-> ds-1 :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))
behrica commented 2 years ago

In some cases we want even to get the tensor back and not the data frame, so omit the last tensor->dataset call.

I think it is a usefull addition in tablecloth, often we go from a dataset to a conceptual 2-d matrix. (but having the matrix rows inside a single dataset column)

behrica commented 2 years ago

Not sure about the reverse. So starting from a dataset with several (numeric) columns, and suqeze them into a single column of native arrays.

behrica commented 2 years ago

For the reverse something like this is working, not sure if optimal:


(def ds
  ;; => _unnamed [3 2]:
  ;;    | :x-0 | :x-1 |
  ;;    |-----:|-----:|
  ;;    |    1 |    4 |
  ;;    |    2 |    5 |
  ;;    |    3 |    6 |
  (->
   (tc/dataset {:x-0 [1 2 3]
                :x-1 [4 5 6]})))

(def rows
  (->
   (tech.v3.datatype/concat-buffers (tc/columns ds))
   (tech.v3.tensor/reshape [(tc/column-count ds)
                            (tc/row-count ds)])
   (tech.v3.tensor/transpose [1 0])
   (tech.v3.tensor/rows)))

(tc/dataset {:x (map tech.v3.datatype/->double-array rows)})
;; => _unnamed [3 1]:
;;    |          :x |
;;    |-------------|
;;    | [D@1600011f |
;;    |  [D@fc74513 |
;;    | [D@20c51970 |
behrica commented 2 years ago

I would think that a pair of functions to go from one representation to the other would be useful.

genmeblog commented 2 years ago

Looks like it's very specific case, kind of transpose of matrix. I'm not sure if it belongs to TC.

The last case (reverse) can be done with join-columns and {:result-type double-array}

BTW, does tensor work on non-numerical data.

genmeblog commented 2 years ago

My original solution landed in 6.103

behrica commented 2 years ago

Numeric only. I think there should be 2 methods for this in TC, they operate on Dataset. Its a specific form of separate.

behrica commented 2 years ago

Numeric only. I think there should be 2 methods for this in TC, they operate on a Dataset. Its a specific form of separate.and require array of same type and length in each row. I can do PR, as I have a use case.

behrica commented 2 years ago

But indeed goes into numeric stuff and going from a datset to a matrix

behrica commented 2 years ago

I will try it out forward and backward. I hve the impressions, without proof, that my code above could be far more performant, but having some constraints.

I will measure it on a larger case.

behrica commented 2 years ago

As I thought. On a 1000 * 1000 double matrix-type of dataset:

(def ds (api/dataset {:x (map 
                          (fn [_] (double-array (range 1000)))
                          (range 1000))}))

we get factor 50 - 100 of execution time difference

(defn use-separate []
 (api/separate-column ds :x))

(defn use-reshape []
 (->
  (tech.v3.datatype/concat-buffers (:x ds))
  (tech.v3.tensor/reshape [(api/row-count ds)
                           (-> ds :x first count)])
  (tech.v3.dataset.tensor/tensor->dataset)))

(time (def _ (use-separate)))
;; Elapsed time: 3371.491881 msecs"
(time (def _ (use-reshape)))
;; "Elapsed time: 76.420533 msecs"

for producing the same dataset.

behrica commented 2 years ago

The reverse ie less of a difference, still factor 5:

(def ds-with-cols (use-reshape))

(time
 (def _  (api/join-columns ds-with-cols :x (api/column-names ds-with-cols) {:result-type double-array})))
;; elapsed time: 333.478279 msecs"
;;
;;
;;

(time
 (let [rows
       (->
        (tech.v3.datatype/concat-buffers (api/columns ds-with-cols))
        (tech.v3.tensor/reshape [(api/column-count ds-with-cols)
                                 (api/row-count ds-with-cols)])
        (tech.v3.tensor/transpose [1 0])
        (tech.v3.tensor/rows))]
   (api/dataset {:x (map tech.v3.datatype/->double-array rows)})))
;; "Elapsed time: 66.384538 msecs"
behrica commented 2 years ago

But I was wrong above, the code works as well with non numeric..

genmeblog commented 2 years ago

Yes, join-columns and separate-column are slow. I know that. These two funcitons are more general than just packing/unpacking sequence to/from column(s). join-columns and separate-column are more-less the same as tidyr's extract, separate and unite functions.

Your example is just one special case - which can be optimized for sure. If you have an idea for PR - it's always welcome.