techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
661 stars 33 forks source link

concat columns with different datatype (columnwise-concat) #64

Closed genmeblog closed 4 years ago

genmeblog commented 4 years ago

How to concat two columns with two datatypes, say :int32 and :int64

I tried through ds/concat but it's not possible. I tried through the readers but the order is important and first datatype is selected (instead of the wider one).

(def col1 (col/new-column :int32 (dtype/->reader [1 2 3 4] :int32)))
(def col2 (col/new-column :int64 (dtype/->reader [1 2 3 4 (dec Long/MAX_VALUE)] :int64)))

(col/new-column :result (concat-rdr/concat-readers (map dtype/->reader [col1 col2])))
;; --> Exception
(col/new-column :result (concat-rdr/concat-readers (map dtype/->reader [col2 col1])))
;; => #tech.ml.dataset.column<int64>[9]
;;    :result
;;    [1, 2, 3, 4, 9223372036854775806, 1, 2, 3, 4, ]

The issue is. I'm working on reshape enhancement (pivot-longer. Enhancement is ready and all examples work but one. There is a dataset who where you have 56 columns to be joined. They are read from csv file and some of the columns get :int16 some get :int32 type. No option to join them. Even with (select-column-names is my helper):

(ds/columnwise-concat who (select-column-names who #(str/starts-with? % "new")))
;; --> Exception
genmeblog commented 4 years ago

Dataset who is here: https://github.com/genmeblog/techtest/tree/master/data

API is updated so you can see my implementation of pivot->longer

cnuernber commented 4 years ago

Hmm, everything you tried should work I guess with automatic widening casts. This i think implies at least two issues.

genmeblog commented 4 years ago

You can try this snippet.

(->> '("new_sp_m014" "new_sp_m1524")
     (ds/select-columns (ds/->dataset "data/who.csv.gz"))
     (ds/columns)
     (map (comp :datatype meta)))
;; => (:int16 :int32)

(ds/columnwise-concat (ds/->dataset "data/who.csv.gz") '("new_sp_m014" "new_sp_m1524"))

1. Unhandled java.lang.IllegalArgumentException
   Value out of range for short: -2147483648

                   RT.java: 1173  clojure.lang.RT/shortCast
               casting.clj:  183  tech.v2.datatype.casting/int16-cast
               casting.clj:  183  tech.v2.datatype.casting/int16-cast
                column.clj:  166  tech.ml.dataset.impl.column.Column/__GT_reader
                concat.clj:   84  tech.v2.datatype.readers.concat/fn/fn
                  core.clj: 6912  clojure.core/mapv/fn
           ArrayChunk.java:   63  clojure.lang.ArrayChunk/reduce
             protocols.clj:  136  clojure.core.protocols/fn
             protocols.clj:  124  clojure.core.protocols/fn
             protocols.clj:   19  clojure.core.protocols/fn/G
             protocols.clj:   31  clojure.core.protocols/seq-reduce
             protocols.clj:   75  clojure.core.protocols/fn
             protocols.clj:   75  clojure.core.protocols/fn
             protocols.clj:   13  clojure.core.protocols/fn/G
                  core.clj: 6828  clojure.core/reduce
                  core.clj: 6903  clojure.core/mapv
                  core.clj: 6903  clojure.core/mapv
                concat.clj:   84  tech.v2.datatype.readers.concat/fn
                concat.clj:   84  tech.v2.datatype.readers.concat/fn
                concat.clj:  141  tech.v2.datatype.readers.concat/concat-readers
                concat.clj:  127  tech.v2.datatype.readers.concat/concat-readers
                concat.clj:  146  tech.v2.datatype.readers.concat/concat-readers
                concat.clj:  127  tech.v2.datatype.readers.concat/concat-readers
                  base.clj:  509  tech.ml.dataset.base/concat/fn
                  core.clj: 2755  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                 Cons.java:   39  clojure.lang.Cons/next
                   RT.java:  653  clojure.lang.RT/countFrom
                   RT.java:  643  clojure.lang.RT/count
                column.clj:  321  tech.ml.dataset.column/ensure-column-seq
                column.clj:  302  tech.ml.dataset.column/ensure-column-seq
               dataset.clj:  191  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  181  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  220  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  181  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  131  tech.ml.dataset.impl.dataset.Dataset/from_prototype
                  base.clj:  507  tech.ml.dataset.base/concat
                  base.clj:  481  tech.ml.dataset.base/concat
               RestFn.java:  139  clojure.lang.RestFn/applyTo
                  core.clj:  665  clojure.core/apply
                  core.clj:  660  clojure.core/apply
               dataset.clj:  257  tech.ml.dataset/columnwise-concat
               dataset.clj:  216  tech.ml.dataset/columnwise-concat
               dataset.clj:  269  tech.ml.dataset/columnwise-concat
               dataset.clj:  216  tech.ml.dataset/columnwise-concat

When columns are swapped, different exception is thrown:

(ds/columnwise-concat (ds/->dataset "data/who.csv.gz") '("new_sp_m1524" "new_sp_m014"))

1. Unhandled java.lang.Exception
   Column :value has mismatched datatypes: #{:int32 :int16}

                  base.clj:  513  tech.ml.dataset.base/concat/fn
                  core.clj: 2755  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                 Cons.java:   39  clojure.lang.Cons/next
                   RT.java:  653  clojure.lang.RT/countFrom
                   RT.java:  643  clojure.lang.RT/count
                column.clj:  321  tech.ml.dataset.column/ensure-column-seq
                column.clj:  302  tech.ml.dataset.column/ensure-column-seq
               dataset.clj:  191  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  181  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  220  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  181  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  131  tech.ml.dataset.impl.dataset.Dataset/from_prototype
                  base.clj:  507  tech.ml.dataset.base/concat
                  base.clj:  481  tech.ml.dataset.base/concat
               RestFn.java:  139  clojure.lang.RestFn/applyTo
                  core.clj:  665  clojure.core/apply
                  core.clj:  660  clojure.core/apply
               dataset.clj:  257  tech.ml.dataset/columnwise-concat
               dataset.clj:  216  tech.ml.dataset/columnwise-concat
               dataset.clj:  269  tech.ml.dataset/columnwise-concat
               dataset.clj:  216  tech.ml.dataset/columnwise-concat
genmeblog commented 4 years ago

Also, how to convert :string column/reader to a integer? (if there are numbers as strings)

cnuernber commented 4 years ago

tech.ml.dataset.column/parse-column: https://github.com/techascent/tech.ml.dataset/blob/master/test/tech/ml/dataset/parse_test.clj#L295

I am working through a bunch of open city data and their column formats are changing weekly so I have a set of columns I force to string and then have to look and and figure out a parsing method for them.

genmeblog commented 4 years ago

Another related example. How to convert columns between types and do not to lose missing values?

Tried to use reader, but failed:

(col/missing (col/parse-column :int16 (col/new-column :a ["1" nil "3"])))
;; => #{1}

(col/missing (col/new-column :int16 (dtype/->reader (col/parse-column :int16 (col/new-column :a ["1" nil "3"])) :int16)))
;; => #{}

I know I can add missing but:

(col/new-column :int32 (dtype/->reader (col/parse-column :int16 (col/new-column :a ["1" nil "3"])) :int16) {} #{1})
;; => #tech.ml.dataset.column<int16>[3]
;;    :int32
;;    [1, -32768, 3, ]

Above missing value at position 1 has -32768 value (instead of Int/MIN_VALUE)

genmeblog commented 4 years ago

So generally I need relatively easy way to:

  1. Concat columns with different datatypes (let's say numerical)
  2. Convert a column between datatypes without losing missing values
cnuernber commented 4 years ago

Got it. Thanks, this is really helpful.

cnuernber commented 4 years ago

Concat with columns of different datatypes and changing the datatype of a column without changing the missing set (2 ways).

Both required changes so this was a really solid issue. Definitely going to see more of these as datatypes and typesystems are hard.

genmeblog commented 4 years ago

Thanks! Works much better! Another case:

(def data (ds/->dataset [{:a "a" :b 0}]))
(map meta (ds/columns data))
;; => ({:categorical? true, :name :a, :size 1, :datatype :string} {:name :b, :size 1, :datatype :int64})
(def data-concat (ds/concat data data))
(map meta (ds/columns data-concat))
;; => ({:categorical? true, :name :a, :size 2, :datatype :object} {:name :b, :size 2, :datatype :int64})

After concat :string columns become :object. Also 0 is :int64 maybe it should be :int16? (It's not big issue),

genmeblog commented 4 years ago

Another case, datetime type is lost after concat:

(def data (ds/->dataset "data/family.csv"))
;; => data/family.csv [5 5]:
;;    | family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
;;    |--------+------------+------------+---------------+---------------|
;;    |      1 | 1998-11-26 | 2000-01-29 |             1 |             2 |
;;    |      2 | 1996-06-22 |            |             2 |               |
;;    |      3 | 2002-07-11 | 2004-04-05 |             2 |             2 |
;;    |      4 | 2004-10-10 | 2009-08-27 |             1 |             1 |
;;    |      5 | 2000-12-05 | 2005-02-28 |             2 |             1 |

(map meta (ds/columns data))
;; => ({:name "family", :size 5, :datatype :int16}
;;     {:name "dob_child1", :size 5, :datatype :packed-local-date}
;;     {:name "dob_child2", :size 5, :datatype :packed-local-date}
;;     {:name "gender_child1", :size 5, :datatype :int16}
;;     {:name "gender_child2", :size 5, :datatype :int16})

(ds/concat data data)
;; => null [10 5]:
;;    | family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
;;    |--------+------------+------------+---------------+---------------|
;;    |      1 |  130943770 |  131072285 |             1 |             2 |
;;    |      2 |  130811414 |            |             2 |               |
;;    |      3 |  131204875 |  131335173 |             2 |             2 |
;;    |      4 |  131336714 |  131663899 |             1 |             1 |
;;    |      5 |  131075077 |  131400220 |             2 |             1 |
;;    |      1 |  130943770 |  131072285 |             1 |             2 |
;;    |      2 |  130811414 |            |             2 |               |
;;    |      3 |  131204875 |  131335173 |             2 |             2 |
;;    |      4 |  131336714 |  131663899 |             1 |             1 |
;;    |      5 |  131075077 |  131400220 |             2 |             1 |

(map meta (ds/columns (ds/concat data data)))
;; => ({:name "family", :size 10, :datatype :int16}
;;     {:name "dob_child1", :size 10, :datatype :int32}
;;     {:name "dob_child2", :size 10, :datatype :int32}
;;     {:name "gender_child1", :size 10, :datatype :int16}
;;     {:name "gender_child2", :size 10, :datatype :int16})
genmeblog commented 4 years ago

and another case:


(def data (ds/->dataset [{:a 1.0} {:a 2.0}]))
data
;; => _unnamed [2 1]:
;;    |    :a |
;;    |-------|
;;    | 1.000 |
;;    | 2.000 |

(map meta (ds/columns data))
;; => ({:name :a, :size 2, :datatype :float64})

(def data-updated (ds/update-column data :a #(map (fn ^double [^double in]
                                                    (if (< in 2.0) (- in) in)) %)))
data-updated
;; => _unnamed [2 1]:
;;    |     :a |
;;    |--------|
;;    | -1.000 |
;;    |  2.000 |

(map meta (ds/columns data-updated))
;; => ({:name :a, :size 2, :datatype :object})
cnuernber commented 4 years ago

The last one isn't directly an issue. Lazy seq's have no datatype; map returns something that erases all type information but I do see what you are trying to do and there should be a one-line way to do it. Currently there is not.

cnuernber commented 4 years ago

Those first two, however, (dates, strings) are terrible ;-).

cnuernber commented 4 years ago

So, this is what I have so far:


user> (-> (ds/->dataset [{:a 1.0} {:a 2.0}])
               (ds/update-column
                :a
                #(dtype/typed-reader-map (fn ^double [^double in]
                                           (if (< in 2.0) (- in) in))
                                         %)))
_unnamed [2 1]:

|     :a |
|--------|
| -1.000 |
|  2.000 |

The result datatype is derived from the input column datatypes. If you want to ensure a particular datatype (because for instance your function is built for a particular datatype) then you will need to coerce all the input columns to that datatype. This does not address missing values at all and the result column will have no missing values under the assumption that the map-fn fixes that problem.

Another option would be to provide a column map function that behaves like this but sets the missing set to the union of the input column missing sets. I am not certain which is correct although I imagine the second one (union) is the correct pathway in general.

cnuernber commented 4 years ago

2nd attempted fix: https://github.com/techascent/tech.ml.dataset/commit/58e41571e98111b94cc05c67ac9157e47b5d2cdf

cnuernber commented 4 years ago
  1. Renamed column-map to column-name->column-map (as that is all it did).
  2. Added column-map which does the union of missing values:

user> (-> (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}])
          (ds/column-map
           :summed
           (fn ^double [^double lhs ^double rhs]
             (+ lhs rhs))
           :a :b))
_unnamed [3 3]:

| :a |    :b | :summed |
|----+-------+---------|
|  1 |       |         |
|    | 2.000 |         |
|  2 | 3.000 |   5.000 |
user> (tech.ml.dataset.column/missing 
       (*1 :summed))
#{0,1}
genmeblog commented 4 years ago

The issue with lazy seq appeared somewhere between beta-33 and beta-37. When I used beta-33 everything worked fine:

(def data (ds/->dataset [{:a 1.0} {:a 2.0}]))
data
;; => _unnamed [2 1]:
;;    |    :a |
;;    |-------|
;;    | 1.000 |
;;    | 2.000 |

(map meta (ds/columns data))
;; => ({:name :a, :size 2, :datatype :float64})

(def data-updated (ds/update-column data :a #(map (fn ^double [^double in]
                                                    (if (< in 2.0) (- in) in)) %)))
data-updated
;; => _unnamed [2 1]:
;;    |     :a |
;;    |--------|
;;    | -1.000 |
;;    |  2.000 |

(map meta (ds/columns data))
;; => ({:name :a, :size 2, :datatype :float64})
genmeblog commented 4 years ago

and yes, this will tough road to tame all the cases :)

genmeblog commented 4 years ago

Oh, just checked and last example works with beta-39.

cnuernber commented 4 years ago

There was a subtle change in there however that we maybe should talk about.

I used to do a quick scan things of type :object to see what datatype they contained. The algorithm was literally something like:

if the first thing is a number, make it double type. If the first thing is a string, make it string type, etc.

That causes issues like a couple of yours (long promoted to double) and was causing a scan of columns when it should not during concat.

I really don't like scanning and inferring types as you either have a type error or you have a runtime performance issue once your datasets get big. And the good scanning/inferring algorithms are in the spreadsheet parsing system (when you can't expect your data to be columnar) or deep in the csv parsing system (where you do) and they are involved which means they would be tough to make fast.

So if at all possible it would be really good if types can be specified upon entry to operations rather than dynamically inferred during operations.

My feeling is that no matter how good the inference algorithm is, it won't be perfect and thus will change and these changes will cause type errors. A stupid type system and one you can force manually is far easier to engineer than a great type system but it does mean the user types more.

genmeblog commented 4 years ago

You are definitely right. It will never be perfect. The question is where is the balance since we have deal with types in untyped world? The most important thing IMHO is to make system consistent. I opt for stupid type system but some paths should be look after. It can be problematic when some columns start to be an :object after certain operations.

Maybe we need the easy way to convert type of the column? string->int, object->float etc... And let the user decide what the type should be and convert if it's possible.

Currently I come up with such function (how to make :object conversion properly?):

(defn convert-column-type
  ([ds coltype-map]
   (reduce (fn [ds [colname new-type]]
             (convert-column-type ds colname new-type)) ds coltype-map))
  ([ds colname new-type]
   (ds/add-or-update-column ds colname
                            (let [current-type (-> colname ds meta :datatype)]
                              (condp = current-type
                                :string (col/parse-column new-type (ds colname))
                                :object (let [curr-col (ds colname)]
                                          (->> (col/new-column colname (dtype/->reader curr-col :string)
                                                               (meta curr-col) (col/missing curr-col))
                                               (col/parse-column new-type)))
                                (let [curr-col (ds colname)]
                                  (col/new-column colname (dtype/->reader curr-col new-type)
                                                  (meta curr-col) (col/missing curr-col))))))))
genmeblog commented 4 years ago

It doesn't belong here but to give another example about inconsistent behaviour is how dfn functions behave:

(dfn/log (dtype/->reader [1 2 3] :float64))
;; => [0.0 0.6931471805599453 1.0986122886681098]
(dfn/log (dtype/->reader [1 2 3] :object))
;; => [0.0 0.6931471805599453 1.0986122886681098]
(dfn/log (dtype/->reader [1 2 3] :int64))
;; -> exception

I would expect an exception on :object type but not on the int.

genmeblog commented 4 years ago

And another one:

DS
;; => _unnamed [9 3]:
;;    |   :V1 | :V2 | :V4 |
;;    |-------+-----+-----|
;;    | 1.000 |   0 |   A |
;;    | 4.000 |   0 |   B |
;;    | 1.000 |   0 |   C |
;;    | 4.000 |   4 |   A |
;;    | 1.000 |   5 |   B |
;;    | 4.000 |   6 |   C |
;;    | 1.000 |   7 |   A |
;;    | 4.000 |   8 |   B |
;;    | 1.000 |   9 |   C |
(map meta (ds/columns DS))
;; => ({:name :V1, :size 9, :datatype :float64}
;;     {:name :V2, :size 9, :datatype :object}
;;     {:name :V4, :size 9, :datatype :object})
(dfn/mean (DS :V2))
;; -> exception

(dfn/mean (dtype/->reader [1 2 3] :object))
;; => 2.0

When column has the :object type, mean fails, when reader has the :object type, mean works. The exception when mean is applied on object column looks like that:

3. Unhandled java.util.concurrent.ExecutionException
   java.lang.ClassCastException

         ForkJoinTask.java: 1006  java.util.concurrent.ForkJoinTask/get
                   for.clj:   39  tech.parallel.for/indexed-map-reduce/fn
                  core.clj: 2753  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                  core.clj:  137  clojure.core/seq
                  core.clj: 3133  clojure.core/dorun
                  core.clj: 3133  clojure.core/dorun
                   for.clj:   40  tech.parallel.for/indexed-map-reduce
                   for.clj:   -1  tech.parallel.for/indexed-map-reduce
                   for.clj:   42  tech.parallel.for/indexed-map-reduce
                   for.clj:   10  tech.parallel.for/indexed-map-reduce
             fast_copy.clj:  118  tech.v2.datatype.fast-copy/parallel-nio-write!
             fast_copy.clj:  110  tech.v2.datatype.fast-copy/parallel-nio-write!
               RestFn.java:  442  clojure.lang.RestFn/invoke
                    io.clj:   70  tech.v2.datatype.io/dense-copy!
                    io.clj:   25  tech.v2.datatype.io/dense-copy!
                    io.clj:  106  tech.v2.datatype.io/eval17092/fn
              MultiFn.java:  239  clojure.lang.MultiFn/invoke
                 base.cljc:  219  tech.v2.datatype.base$copy_BANG_/invokeStatic
                 base.cljc:  203  tech.v2.datatype.base$copy_BANG_/invoke
                 array.clj:  217  tech.v2.datatype.array/add-numeric-array-constructor/fn
                 array.clj:  261  tech.v2.datatype.array/make-array-of-type
                 array.clj:  255  tech.v2.datatype.array/make-array-of-type
                 array.clj:  269  tech.v2.datatype.array/eval27510/fn
              MultiFn.java:  244  clojure.lang.MultiFn/invoke
                 base.cljc:  100  tech.v2.datatype.base$make_container/invokeStatic
                 base.cljc:   95  tech.v2.datatype.base$make_container/invoke
                 base.cljc:  166  tech.v2.datatype.base$__GT_double_array/invokeStatic
                 base.cljc:  161  tech.v2.datatype.base$__GT_double_array/invoke
            statistics.clj:   72  tech.v2.datatype.statistics/descriptive-stats
            statistics.clj:   67  tech.v2.datatype.statistics/descriptive-stats
               RestFn.java:  423  clojure.lang.RestFn/invoke
            statistics.clj:  101  tech.v2.datatype.statistics/mean
            statistics.clj:  101  tech.v2.datatype.statistics/mean

2. Caused by java.lang.ClassCastException
   (No message)

NativeConstructorAccessorImpl.java:   -2  jdk.internal.reflect.NativeConstructorAccessorImpl/newInstance0
NativeConstructorAccessorImpl.java:   62  jdk.internal.reflect.NativeConstructorAccessorImpl/newInstance
DelegatingConstructorAccessorImpl.java:   45  jdk.internal.reflect.DelegatingConstructorAccessorImpl/newInstance
          Constructor.java:  490  java.lang.reflect.Constructor/newInstance
         ForkJoinTask.java:  603  java.util.concurrent.ForkJoinTask/getThrowableException
         ForkJoinTask.java: 1006  java.util.concurrent.ForkJoinTask/get
                   for.clj:   39  tech.parallel.for/indexed-map-reduce/fn
                  core.clj: 2753  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                  core.clj:  137  clojure.core/seq
                  core.clj: 3133  clojure.core/dorun
                  core.clj: 3133  clojure.core/dorun
                   for.clj:   40  tech.parallel.for/indexed-map-reduce
                   for.clj:   -1  tech.parallel.for/indexed-map-reduce
                   for.clj:   42  tech.parallel.for/indexed-map-reduce
                   for.clj:   10  tech.parallel.for/indexed-map-reduce
             fast_copy.clj:  118  tech.v2.datatype.fast-copy/parallel-nio-write!
             fast_copy.clj:  110  tech.v2.datatype.fast-copy/parallel-nio-write!
               RestFn.java:  442  clojure.lang.RestFn/invoke
                    io.clj:   70  tech.v2.datatype.io/dense-copy!
                    io.clj:   25  tech.v2.datatype.io/dense-copy!
                    io.clj:  106  tech.v2.datatype.io/eval17092/fn
              MultiFn.java:  239  clojure.lang.MultiFn/invoke
                 base.cljc:  219  tech.v2.datatype.base$copy_BANG_/invokeStatic
                 base.cljc:  203  tech.v2.datatype.base$copy_BANG_/invoke
                 array.clj:  217  tech.v2.datatype.array/add-numeric-array-constructor/fn
                 array.clj:  261  tech.v2.datatype.array/make-array-of-type
                 array.clj:  255  tech.v2.datatype.array/make-array-of-type
                 array.clj:  269  tech.v2.datatype.array/eval27510/fn
              MultiFn.java:  244  clojure.lang.MultiFn/invoke
                 base.cljc:  100  tech.v2.datatype.base$make_container/invokeStatic
                 base.cljc:   95  tech.v2.datatype.base$make_container/invoke
                 base.cljc:  166  tech.v2.datatype.base$__GT_double_array/invokeStatic
                 base.cljc:  161  tech.v2.datatype.base$__GT_double_array/invoke
            statistics.clj:   72  tech.v2.datatype.statistics/descriptive-stats
            statistics.clj:   67  tech.v2.datatype.statistics/descriptive-stats
               RestFn.java:  423  clojure.lang.RestFn/invoke
            statistics.clj:  101  tech.v2.datatype.statistics/mean
            statistics.clj:  101  tech.v2.datatype.statistics/mean

1. Caused by java.lang.ClassCastException
   class
   tech.v2.datatype.array$extend_object_array_type$fn$reify__27533
   cannot be cast to class tech.v2.datatype.DoubleReader
   (tech.v2.datatype.array$extend_object_array_type$fn$reify__27533 is
   in unnamed module of loader clojure.lang.DynamicClassLoader
   @2fc079e; tech.v2.datatype.DoubleReader is in unnamed module of
   loader 'app')

             fast_copy.clj:  118  tech.v2.datatype.fast-copy/parallel-nio-write!/fn
             fast_copy.clj:   -1  tech.v2.datatype.fast-copy/parallel-nio-write!/fn
                   for.clj:   37  tech.parallel.for/indexed-map-reduce/fn/callable
                  AFn.java:   18  clojure.lang.AFn/call
         ForkJoinTask.java: 1448  java.util.concurrent.ForkJoinTask$AdaptedCallable/exec
         ForkJoinTask.java:  290  java.util.concurrent.ForkJoinTask/doExec
         ForkJoinPool.java: 1020  java.util.concurrent.ForkJoinPool$WorkQueue/topLevelExec
         ForkJoinPool.java: 1656  java.util.concurrent.ForkJoinPool/scan
         ForkJoinPool.java: 1594  java.util.concurrent.ForkJoinPool/runWorker
 ForkJoinWorkerThread.java:  177  java.util.concurrent.ForkJoinWorkerThread/run

The :V2 column was created in below operations:

DS
;; => _unnamed [9 3]:
;;    |   :V1 | :V2 | :V4 |
;;    |-------+-----+-----|
;;    | 1.000 |   1 |   A |
;;    | 4.000 |   2 |   B |
;;    | 1.000 |   3 |   C |
;;    | 4.000 |   4 |   A |
;;    | 1.000 |   5 |   B |
;;    | 4.000 |   6 |   C |
;;    | 1.000 |   7 |   A |
;;    | 4.000 |   8 |   B |
;;    | 1.000 |   9 |   C |

(map meta (ds/columns DS))
;; => ({:name :V1, :size 9, :datatype :float64}
;;     {:name :V2, :size 9, :datatype :int64}
;;     {:name :V4, :size 9, :datatype :object})

(map meta (ds/columns (ds/update-column DS :V2 #(map (fn [^long v]
                                                       (if (< v 4) 0 v)) %))))
;; => ({:name :V1, :size 9, :datatype :float64}
;;     {:name :V2, :size 9, :datatype :object}
;;     {:name :V4, :size 9, :datatype :object})
cnuernber commented 4 years ago

This is super helpful. You are getting some subtle architectural issues where I am not certain I have a right answer. Will respond with more later.

One thing I tried to do was use the Clojure numerics stack when the datatype was :object as this allows its auto-promotion rules to work. So the definitely of '/' depends on the datatype with the system falling back to Clojure's number tower when the datatype is :object as there are some things that the number tower does handle pretty well IMO.

genmeblog commented 4 years ago

Maybe let user see just :number datatype regardless actual representation then? Maybe this can simplify things? :number, :boolean, :string, :datetime and :object should be enough to have most of the stuff working. I think also that users of the library shouldn't care too much about what is optimal or not. log function? should work on number and that's all. If type is an :object - let user decide about the conversion. Nothing more is needed I suppose if we consider types. Just thinking...

genmeblog commented 4 years ago

Omg just found dtype/set-datatype to change type of column :)

edit: but it doesn't try to coerce string to number

cnuernber commented 4 years ago

Nope, it does let you, however, coerce an object column to a numeric column. It is roughly equivalent to static-cast in c++.

genmeblog commented 4 years ago

Looks that I don't understand conversions... dtype/set-datatype sets only column type without conversion. dtype/->reader [1 2] :string also doesn't converts seq to strings. So still I have the same question open:

How to convert column to different type?

Some of the conversion work, some doesn't. :int32 -> :string doesn't work, :int32 -> :float64 works. :/

(def c (col/new-column :a [1 2 3]))
c
;; => #tech.ml.dataset.column<object>[3]
;;    :a
;;    [1, 2, 3, ]

(type (first c))
;; => java.lang.Long

(dtype/->reader c :float64)
;; => [1.0 2.0 3.0]

(def cf (dtype/set-datatype c :float64))
cf
;; => #tech.ml.dataset.column<float64>[3]
;;    :a
;;    [1.000, 2.000, 3.000, ]

(type (first cf))
;; => java.lang.Double

(def cs (dtype/set-datatype c :string))
cs
;; => #tech.ml.dataset.column<string>[3]
;;    :a
;;    [1, 2, 3, ]

;; ?????
(type (first cs))
;; => java.lang.Long
cnuernber commented 4 years ago

I think the best I can do here is just explain why it is and we can talk about a higher level of what you really want.

Yes, set-datatype be too subtle really for a public API. If the data is numeric (packed-local-date, int float, etc) then set datatype will ensure the column creates the correct reader type with the correct missing value type. This works because tech.datatype knows about casts between numeric datatypes. It doesn't however, know about conversions between string and numbers or strings and dates or any of the infinite number of possible objects.

parse-column specifically converts from strings to other datatypes. The rules are the same as in clojure, you can type (int a) and get an integer from a double but you can't type (int a) if a is a string. Conversion from string to something else is a different operation.

At a base level you want to be able to quickly change the column datatype and have something that understands these rules or optionally provide a function to use for the conversion. parse-column is close here as you can specify the result datatype and a function that may return :missing or :parse-failure for where the conversion fails. set-datatype is more like a c or c++ cast which of course is just going to confuse people (but is necessary in the internals).

In this way, 'parse-column' is close to a map operation where you specify the resulting datatype of the result column as opposed to my first attempt at column-map which derives the result datatype from the input column datatypes.

We are getting close to something good here. Also I just fixed the (dfn/log) type ops on integers produce errors. There are potentially another set of errors for things like bitwise-and when used for floating point numbers (should it round, should it floor, should it ceil?) . For sin,log,etc I think casting to a float or double is fine as the chance of loss of data is minimal. But for operations that are meant to happen only in integer space I think it should force the user to cast as casting from float to integer absolutely involves loss of data.

In order for any of this to be sane we need a solid way to convert column datatypes where you can specify the result datatype and optionally a conversion-fn and whether a conversion error results in missing values or if it results in an exception the propagates. That is slightly different than column-map or parse-column or set-datatype :-).

This is really great, I feel like we are making really solid progress with the stickiest part of the problem which is types and type conversions. What we don't have I think are good ways to force the type that have customized conversion routines with defined outcomes in the case of missing values and in the case where the conversion routine fails. That is really close to parse-column. The issue is that do you want this to be lazy in which case parse failures have to propagate or do you want this to be immediate and take perf hit from potentially a needless iteration over the entire column.

A work around is get-or-create-column where you create a new column (ds-col/new-column) with a reader of a defined type (dtype/make-reader; here you can do the conversion) and you set the missing values explicitly as part of the new-column pathway.

genmeblog commented 4 years ago

Thanks for deep explanation I think I've got some intuition about this. If something new will appear to be discussed we can move to Zulip (I didn't want to spam Zulip stream with bunch of code snippets). Yeah, we are close :) I see two missing pieces only:

  1. What you've proposed above: let user provide conversion function and advertise the type. If datatype lib accepts the result - conversion is done, if not - exception is thrown and user has to fix the conversion.
  2. Auto-conversion to :string - straighforward application of str.

I agree that conversion from float -> int is not possible without user involvment. Possibly other.

Thanks for pushing this forward so quickly!

cnuernber commented 4 years ago

New release out that has a new column-cast function.

This is a casting power tool. It will use the parsing system on string columns if cast-fn isn't provided, for instance. It will do the right thing with packed-local-date columns (convert them to local dates) before a conversion happens so your cast-fn doesn't have to understand packed-vs-unpacked data.

The function also parallelizes work when possible so things like (Math/round (double %)) are fast. Basically this is a generic function of (any datatype)->dest-datatype where you can provide the cast function. It is not at all lazy so you know if it succeeds then you have packed data in the datatype that you chose.

genmeblog commented 4 years ago

column-cast looks great! Thank you! I hope we are done here :)