Closed genmeblog closed 4 years ago
Dataset who
is here: https://github.com/genmeblog/techtest/tree/master/data
API is updated so you can see my implementation of pivot->longer
Hmm, everything you tried should work I guess with automatic widening casts. This i think implies at least two issues.
You can try this snippet.
(->> '("new_sp_m014" "new_sp_m1524")
(ds/select-columns (ds/->dataset "data/who.csv.gz"))
(ds/columns)
(map (comp :datatype meta)))
;; => (:int16 :int32)
(ds/columnwise-concat (ds/->dataset "data/who.csv.gz") '("new_sp_m014" "new_sp_m1524"))
1. Unhandled java.lang.IllegalArgumentException
Value out of range for short: -2147483648
RT.java: 1173 clojure.lang.RT/shortCast
casting.clj: 183 tech.v2.datatype.casting/int16-cast
casting.clj: 183 tech.v2.datatype.casting/int16-cast
column.clj: 166 tech.ml.dataset.impl.column.Column/__GT_reader
concat.clj: 84 tech.v2.datatype.readers.concat/fn/fn
core.clj: 6912 clojure.core/mapv/fn
ArrayChunk.java: 63 clojure.lang.ArrayChunk/reduce
protocols.clj: 136 clojure.core.protocols/fn
protocols.clj: 124 clojure.core.protocols/fn
protocols.clj: 19 clojure.core.protocols/fn/G
protocols.clj: 31 clojure.core.protocols/seq-reduce
protocols.clj: 75 clojure.core.protocols/fn
protocols.clj: 75 clojure.core.protocols/fn
protocols.clj: 13 clojure.core.protocols/fn/G
core.clj: 6828 clojure.core/reduce
core.clj: 6903 clojure.core/mapv
core.clj: 6903 clojure.core/mapv
concat.clj: 84 tech.v2.datatype.readers.concat/fn
concat.clj: 84 tech.v2.datatype.readers.concat/fn
concat.clj: 141 tech.v2.datatype.readers.concat/concat-readers
concat.clj: 127 tech.v2.datatype.readers.concat/concat-readers
concat.clj: 146 tech.v2.datatype.readers.concat/concat-readers
concat.clj: 127 tech.v2.datatype.readers.concat/concat-readers
base.clj: 509 tech.ml.dataset.base/concat/fn
core.clj: 2755 clojure.core/map/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
Cons.java: 39 clojure.lang.Cons/next
RT.java: 653 clojure.lang.RT/countFrom
RT.java: 643 clojure.lang.RT/count
column.clj: 321 tech.ml.dataset.column/ensure-column-seq
column.clj: 302 tech.ml.dataset.column/ensure-column-seq
dataset.clj: 191 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 181 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 220 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 181 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 131 tech.ml.dataset.impl.dataset.Dataset/from_prototype
base.clj: 507 tech.ml.dataset.base/concat
base.clj: 481 tech.ml.dataset.base/concat
RestFn.java: 139 clojure.lang.RestFn/applyTo
core.clj: 665 clojure.core/apply
core.clj: 660 clojure.core/apply
dataset.clj: 257 tech.ml.dataset/columnwise-concat
dataset.clj: 216 tech.ml.dataset/columnwise-concat
dataset.clj: 269 tech.ml.dataset/columnwise-concat
dataset.clj: 216 tech.ml.dataset/columnwise-concat
When columns are swapped, different exception is thrown:
(ds/columnwise-concat (ds/->dataset "data/who.csv.gz") '("new_sp_m1524" "new_sp_m014"))
1. Unhandled java.lang.Exception
Column :value has mismatched datatypes: #{:int32 :int16}
base.clj: 513 tech.ml.dataset.base/concat/fn
core.clj: 2755 clojure.core/map/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
Cons.java: 39 clojure.lang.Cons/next
RT.java: 653 clojure.lang.RT/countFrom
RT.java: 643 clojure.lang.RT/count
column.clj: 321 tech.ml.dataset.column/ensure-column-seq
column.clj: 302 tech.ml.dataset.column/ensure-column-seq
dataset.clj: 191 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 181 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 220 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 181 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 131 tech.ml.dataset.impl.dataset.Dataset/from_prototype
base.clj: 507 tech.ml.dataset.base/concat
base.clj: 481 tech.ml.dataset.base/concat
RestFn.java: 139 clojure.lang.RestFn/applyTo
core.clj: 665 clojure.core/apply
core.clj: 660 clojure.core/apply
dataset.clj: 257 tech.ml.dataset/columnwise-concat
dataset.clj: 216 tech.ml.dataset/columnwise-concat
dataset.clj: 269 tech.ml.dataset/columnwise-concat
dataset.clj: 216 tech.ml.dataset/columnwise-concat
Also, how to convert :string
column/reader to a integer? (if there are numbers as strings)
tech.ml.dataset.column/parse-column
:
https://github.com/techascent/tech.ml.dataset/blob/master/test/tech/ml/dataset/parse_test.clj#L295
I am working through a bunch of open city data and their column formats are changing weekly so I have a set of columns I force to string and then have to look and and figure out a parsing method for them.
Another related example. How to convert columns between types and do not to lose missing values?
Tried to use reader, but failed:
(col/missing (col/parse-column :int16 (col/new-column :a ["1" nil "3"])))
;; => #{1}
(col/missing (col/new-column :int16 (dtype/->reader (col/parse-column :int16 (col/new-column :a ["1" nil "3"])) :int16)))
;; => #{}
I know I can add missing but:
(col/new-column :int32 (dtype/->reader (col/parse-column :int16 (col/new-column :a ["1" nil "3"])) :int16) {} #{1})
;; => #tech.ml.dataset.column<int16>[3]
;; :int32
;; [1, -32768, 3, ]
Above missing value at position 1
has -32768 value (instead of Int/MIN_VALUE
)
So generally I need relatively easy way to:
Got it. Thanks, this is really helpful.
Concat with columns of different datatypes and changing the datatype of a column without changing the missing set (2 ways).
Both required changes so this was a really solid issue. Definitely going to see more of these as datatypes and typesystems are hard.
Thanks! Works much better! Another case:
(def data (ds/->dataset [{:a "a" :b 0}]))
(map meta (ds/columns data))
;; => ({:categorical? true, :name :a, :size 1, :datatype :string} {:name :b, :size 1, :datatype :int64})
(def data-concat (ds/concat data data))
(map meta (ds/columns data-concat))
;; => ({:categorical? true, :name :a, :size 2, :datatype :object} {:name :b, :size 2, :datatype :int64})
After concat :string
columns become :object
.
Also 0
is :int64
maybe it should be :int16
? (It's not big issue),
Another case, datetime type is lost after concat:
(def data (ds/->dataset "data/family.csv"))
;; => data/family.csv [5 5]:
;; | family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
;; |--------+------------+------------+---------------+---------------|
;; | 1 | 1998-11-26 | 2000-01-29 | 1 | 2 |
;; | 2 | 1996-06-22 | | 2 | |
;; | 3 | 2002-07-11 | 2004-04-05 | 2 | 2 |
;; | 4 | 2004-10-10 | 2009-08-27 | 1 | 1 |
;; | 5 | 2000-12-05 | 2005-02-28 | 2 | 1 |
(map meta (ds/columns data))
;; => ({:name "family", :size 5, :datatype :int16}
;; {:name "dob_child1", :size 5, :datatype :packed-local-date}
;; {:name "dob_child2", :size 5, :datatype :packed-local-date}
;; {:name "gender_child1", :size 5, :datatype :int16}
;; {:name "gender_child2", :size 5, :datatype :int16})
(ds/concat data data)
;; => null [10 5]:
;; | family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
;; |--------+------------+------------+---------------+---------------|
;; | 1 | 130943770 | 131072285 | 1 | 2 |
;; | 2 | 130811414 | | 2 | |
;; | 3 | 131204875 | 131335173 | 2 | 2 |
;; | 4 | 131336714 | 131663899 | 1 | 1 |
;; | 5 | 131075077 | 131400220 | 2 | 1 |
;; | 1 | 130943770 | 131072285 | 1 | 2 |
;; | 2 | 130811414 | | 2 | |
;; | 3 | 131204875 | 131335173 | 2 | 2 |
;; | 4 | 131336714 | 131663899 | 1 | 1 |
;; | 5 | 131075077 | 131400220 | 2 | 1 |
(map meta (ds/columns (ds/concat data data)))
;; => ({:name "family", :size 10, :datatype :int16}
;; {:name "dob_child1", :size 10, :datatype :int32}
;; {:name "dob_child2", :size 10, :datatype :int32}
;; {:name "gender_child1", :size 10, :datatype :int16}
;; {:name "gender_child2", :size 10, :datatype :int16})
and another case:
(def data (ds/->dataset [{:a 1.0} {:a 2.0}]))
data
;; => _unnamed [2 1]:
;; | :a |
;; |-------|
;; | 1.000 |
;; | 2.000 |
(map meta (ds/columns data))
;; => ({:name :a, :size 2, :datatype :float64})
(def data-updated (ds/update-column data :a #(map (fn ^double [^double in]
(if (< in 2.0) (- in) in)) %)))
data-updated
;; => _unnamed [2 1]:
;; | :a |
;; |--------|
;; | -1.000 |
;; | 2.000 |
(map meta (ds/columns data-updated))
;; => ({:name :a, :size 2, :datatype :object})
The last one isn't directly an issue. Lazy seq's have no datatype; map returns something that erases all type information but I do see what you are trying to do and there should be a one-line way to do it. Currently there is not.
Those first two, however, (dates, strings) are terrible ;-).
So, this is what I have so far:
user> (-> (ds/->dataset [{:a 1.0} {:a 2.0}])
(ds/update-column
:a
#(dtype/typed-reader-map (fn ^double [^double in]
(if (< in 2.0) (- in) in))
%)))
_unnamed [2 1]:
| :a |
|--------|
| -1.000 |
| 2.000 |
The result datatype is derived from the input column datatypes. If you want to ensure a particular datatype (because for instance your function is built for a particular datatype) then you will need to coerce all the input columns to that datatype. This does not address missing values at all and the result column will have no missing values under the assumption that the map-fn fixes that problem.
Another option would be to provide a column map function that behaves like this but sets the missing set to the union of the input column missing sets. I am not certain which is correct although I imagine the second one (union) is the correct pathway in general.
user> (-> (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}])
(ds/column-map
:summed
(fn ^double [^double lhs ^double rhs]
(+ lhs rhs))
:a :b))
_unnamed [3 3]:
| :a | :b | :summed |
|----+-------+---------|
| 1 | | |
| | 2.000 | |
| 2 | 3.000 | 5.000 |
user> (tech.ml.dataset.column/missing
(*1 :summed))
#{0,1}
The issue with lazy seq appeared somewhere between beta-33 and beta-37. When I used beta-33 everything worked fine:
(def data (ds/->dataset [{:a 1.0} {:a 2.0}]))
data
;; => _unnamed [2 1]:
;; | :a |
;; |-------|
;; | 1.000 |
;; | 2.000 |
(map meta (ds/columns data))
;; => ({:name :a, :size 2, :datatype :float64})
(def data-updated (ds/update-column data :a #(map (fn ^double [^double in]
(if (< in 2.0) (- in) in)) %)))
data-updated
;; => _unnamed [2 1]:
;; | :a |
;; |--------|
;; | -1.000 |
;; | 2.000 |
(map meta (ds/columns data))
;; => ({:name :a, :size 2, :datatype :float64})
and yes, this will tough road to tame all the cases :)
Oh, just checked and last example works with beta-39.
There was a subtle change in there however that we maybe should talk about.
I used to do a quick scan things of type :object to see what datatype they contained. The algorithm was literally something like:
if the first thing is a number, make it double type. If the first thing is a string, make it string type, etc.
That causes issues like a couple of yours (long promoted to double) and was causing a scan of columns when it should not during concat.
I really don't like scanning and inferring types as you either have a type error or you have a runtime performance issue once your datasets get big. And the good scanning/inferring algorithms are in the spreadsheet parsing system (when you can't expect your data to be columnar) or deep in the csv parsing system (where you do) and they are involved which means they would be tough to make fast.
So if at all possible it would be really good if types can be specified upon entry to operations rather than dynamically inferred during operations.
My feeling is that no matter how good the inference algorithm is, it won't be perfect and thus will change and these changes will cause type errors. A stupid type system and one you can force manually is far easier to engineer than a great type system but it does mean the user types more.
You are definitely right. It will never be perfect. The question is where is the balance since we have deal with types in untyped world? The most important thing IMHO is to make system consistent. I opt for stupid type system but some paths should be look after. It can be problematic when some columns start to be an :object
after certain operations.
Maybe we need the easy way to convert type of the column? string->int, object->float etc... And let the user decide what the type should be and convert if it's possible.
Currently I come up with such function (how to make :object
conversion properly?):
(defn convert-column-type
([ds coltype-map]
(reduce (fn [ds [colname new-type]]
(convert-column-type ds colname new-type)) ds coltype-map))
([ds colname new-type]
(ds/add-or-update-column ds colname
(let [current-type (-> colname ds meta :datatype)]
(condp = current-type
:string (col/parse-column new-type (ds colname))
:object (let [curr-col (ds colname)]
(->> (col/new-column colname (dtype/->reader curr-col :string)
(meta curr-col) (col/missing curr-col))
(col/parse-column new-type)))
(let [curr-col (ds colname)]
(col/new-column colname (dtype/->reader curr-col new-type)
(meta curr-col) (col/missing curr-col))))))))
It doesn't belong here but to give another example about inconsistent behaviour is how dfn
functions behave:
(dfn/log (dtype/->reader [1 2 3] :float64))
;; => [0.0 0.6931471805599453 1.0986122886681098]
(dfn/log (dtype/->reader [1 2 3] :object))
;; => [0.0 0.6931471805599453 1.0986122886681098]
(dfn/log (dtype/->reader [1 2 3] :int64))
;; -> exception
I would expect an exception on :object
type but not on the int
.
And another one:
DS
;; => _unnamed [9 3]:
;; | :V1 | :V2 | :V4 |
;; |-------+-----+-----|
;; | 1.000 | 0 | A |
;; | 4.000 | 0 | B |
;; | 1.000 | 0 | C |
;; | 4.000 | 4 | A |
;; | 1.000 | 5 | B |
;; | 4.000 | 6 | C |
;; | 1.000 | 7 | A |
;; | 4.000 | 8 | B |
;; | 1.000 | 9 | C |
(map meta (ds/columns DS))
;; => ({:name :V1, :size 9, :datatype :float64}
;; {:name :V2, :size 9, :datatype :object}
;; {:name :V4, :size 9, :datatype :object})
(dfn/mean (DS :V2))
;; -> exception
(dfn/mean (dtype/->reader [1 2 3] :object))
;; => 2.0
When column has the :object
type, mean
fails, when reader has the :object
type, mean
works.
The exception when mean is applied on object column looks like that:
3. Unhandled java.util.concurrent.ExecutionException
java.lang.ClassCastException
ForkJoinTask.java: 1006 java.util.concurrent.ForkJoinTask/get
for.clj: 39 tech.parallel.for/indexed-map-reduce/fn
core.clj: 2753 clojure.core/map/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
RT.java: 535 clojure.lang.RT/seq
core.clj: 137 clojure.core/seq
core.clj: 3133 clojure.core/dorun
core.clj: 3133 clojure.core/dorun
for.clj: 40 tech.parallel.for/indexed-map-reduce
for.clj: -1 tech.parallel.for/indexed-map-reduce
for.clj: 42 tech.parallel.for/indexed-map-reduce
for.clj: 10 tech.parallel.for/indexed-map-reduce
fast_copy.clj: 118 tech.v2.datatype.fast-copy/parallel-nio-write!
fast_copy.clj: 110 tech.v2.datatype.fast-copy/parallel-nio-write!
RestFn.java: 442 clojure.lang.RestFn/invoke
io.clj: 70 tech.v2.datatype.io/dense-copy!
io.clj: 25 tech.v2.datatype.io/dense-copy!
io.clj: 106 tech.v2.datatype.io/eval17092/fn
MultiFn.java: 239 clojure.lang.MultiFn/invoke
base.cljc: 219 tech.v2.datatype.base$copy_BANG_/invokeStatic
base.cljc: 203 tech.v2.datatype.base$copy_BANG_/invoke
array.clj: 217 tech.v2.datatype.array/add-numeric-array-constructor/fn
array.clj: 261 tech.v2.datatype.array/make-array-of-type
array.clj: 255 tech.v2.datatype.array/make-array-of-type
array.clj: 269 tech.v2.datatype.array/eval27510/fn
MultiFn.java: 244 clojure.lang.MultiFn/invoke
base.cljc: 100 tech.v2.datatype.base$make_container/invokeStatic
base.cljc: 95 tech.v2.datatype.base$make_container/invoke
base.cljc: 166 tech.v2.datatype.base$__GT_double_array/invokeStatic
base.cljc: 161 tech.v2.datatype.base$__GT_double_array/invoke
statistics.clj: 72 tech.v2.datatype.statistics/descriptive-stats
statistics.clj: 67 tech.v2.datatype.statistics/descriptive-stats
RestFn.java: 423 clojure.lang.RestFn/invoke
statistics.clj: 101 tech.v2.datatype.statistics/mean
statistics.clj: 101 tech.v2.datatype.statistics/mean
2. Caused by java.lang.ClassCastException
(No message)
NativeConstructorAccessorImpl.java: -2 jdk.internal.reflect.NativeConstructorAccessorImpl/newInstance0
NativeConstructorAccessorImpl.java: 62 jdk.internal.reflect.NativeConstructorAccessorImpl/newInstance
DelegatingConstructorAccessorImpl.java: 45 jdk.internal.reflect.DelegatingConstructorAccessorImpl/newInstance
Constructor.java: 490 java.lang.reflect.Constructor/newInstance
ForkJoinTask.java: 603 java.util.concurrent.ForkJoinTask/getThrowableException
ForkJoinTask.java: 1006 java.util.concurrent.ForkJoinTask/get
for.clj: 39 tech.parallel.for/indexed-map-reduce/fn
core.clj: 2753 clojure.core/map/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
RT.java: 535 clojure.lang.RT/seq
core.clj: 137 clojure.core/seq
core.clj: 3133 clojure.core/dorun
core.clj: 3133 clojure.core/dorun
for.clj: 40 tech.parallel.for/indexed-map-reduce
for.clj: -1 tech.parallel.for/indexed-map-reduce
for.clj: 42 tech.parallel.for/indexed-map-reduce
for.clj: 10 tech.parallel.for/indexed-map-reduce
fast_copy.clj: 118 tech.v2.datatype.fast-copy/parallel-nio-write!
fast_copy.clj: 110 tech.v2.datatype.fast-copy/parallel-nio-write!
RestFn.java: 442 clojure.lang.RestFn/invoke
io.clj: 70 tech.v2.datatype.io/dense-copy!
io.clj: 25 tech.v2.datatype.io/dense-copy!
io.clj: 106 tech.v2.datatype.io/eval17092/fn
MultiFn.java: 239 clojure.lang.MultiFn/invoke
base.cljc: 219 tech.v2.datatype.base$copy_BANG_/invokeStatic
base.cljc: 203 tech.v2.datatype.base$copy_BANG_/invoke
array.clj: 217 tech.v2.datatype.array/add-numeric-array-constructor/fn
array.clj: 261 tech.v2.datatype.array/make-array-of-type
array.clj: 255 tech.v2.datatype.array/make-array-of-type
array.clj: 269 tech.v2.datatype.array/eval27510/fn
MultiFn.java: 244 clojure.lang.MultiFn/invoke
base.cljc: 100 tech.v2.datatype.base$make_container/invokeStatic
base.cljc: 95 tech.v2.datatype.base$make_container/invoke
base.cljc: 166 tech.v2.datatype.base$__GT_double_array/invokeStatic
base.cljc: 161 tech.v2.datatype.base$__GT_double_array/invoke
statistics.clj: 72 tech.v2.datatype.statistics/descriptive-stats
statistics.clj: 67 tech.v2.datatype.statistics/descriptive-stats
RestFn.java: 423 clojure.lang.RestFn/invoke
statistics.clj: 101 tech.v2.datatype.statistics/mean
statistics.clj: 101 tech.v2.datatype.statistics/mean
1. Caused by java.lang.ClassCastException
class
tech.v2.datatype.array$extend_object_array_type$fn$reify__27533
cannot be cast to class tech.v2.datatype.DoubleReader
(tech.v2.datatype.array$extend_object_array_type$fn$reify__27533 is
in unnamed module of loader clojure.lang.DynamicClassLoader
@2fc079e; tech.v2.datatype.DoubleReader is in unnamed module of
loader 'app')
fast_copy.clj: 118 tech.v2.datatype.fast-copy/parallel-nio-write!/fn
fast_copy.clj: -1 tech.v2.datatype.fast-copy/parallel-nio-write!/fn
for.clj: 37 tech.parallel.for/indexed-map-reduce/fn/callable
AFn.java: 18 clojure.lang.AFn/call
ForkJoinTask.java: 1448 java.util.concurrent.ForkJoinTask$AdaptedCallable/exec
ForkJoinTask.java: 290 java.util.concurrent.ForkJoinTask/doExec
ForkJoinPool.java: 1020 java.util.concurrent.ForkJoinPool$WorkQueue/topLevelExec
ForkJoinPool.java: 1656 java.util.concurrent.ForkJoinPool/scan
ForkJoinPool.java: 1594 java.util.concurrent.ForkJoinPool/runWorker
ForkJoinWorkerThread.java: 177 java.util.concurrent.ForkJoinWorkerThread/run
The :V2 column was created in below operations:
DS
;; => _unnamed [9 3]:
;; | :V1 | :V2 | :V4 |
;; |-------+-----+-----|
;; | 1.000 | 1 | A |
;; | 4.000 | 2 | B |
;; | 1.000 | 3 | C |
;; | 4.000 | 4 | A |
;; | 1.000 | 5 | B |
;; | 4.000 | 6 | C |
;; | 1.000 | 7 | A |
;; | 4.000 | 8 | B |
;; | 1.000 | 9 | C |
(map meta (ds/columns DS))
;; => ({:name :V1, :size 9, :datatype :float64}
;; {:name :V2, :size 9, :datatype :int64}
;; {:name :V4, :size 9, :datatype :object})
(map meta (ds/columns (ds/update-column DS :V2 #(map (fn [^long v]
(if (< v 4) 0 v)) %))))
;; => ({:name :V1, :size 9, :datatype :float64}
;; {:name :V2, :size 9, :datatype :object}
;; {:name :V4, :size 9, :datatype :object})
This is super helpful. You are getting some subtle architectural issues where I am not certain I have a right answer. Will respond with more later.
One thing I tried to do was use the Clojure numerics stack when the datatype was :object as this allows its auto-promotion rules to work. So the definitely of '/' depends on the datatype with the system falling back to Clojure's number tower when the datatype is :object as there are some things that the number tower does handle pretty well IMO.
Maybe let user see just :number
datatype regardless actual representation then? Maybe this can simplify things? :number
, :boolean
, :string
, :datetime
and :object
should be enough to have most of the stuff working.
I think also that users of the library shouldn't care too much about what is optimal or not. log
function? should work on number and that's all. If type is an :object
- let user decide about the conversion. Nothing more is needed I suppose if we consider types.
Just thinking...
Omg just found dtype/set-datatype
to change type of column :)
edit: but it doesn't try to coerce string to number
Nope, it does let you, however, coerce an object column to a numeric column. It is roughly equivalent to static-cast
in c++.
Looks that I don't understand conversions... dtype/set-datatype
sets only column type without conversion. dtype/->reader [1 2] :string
also doesn't converts seq to strings. So still I have the same question open:
How to convert column to different type?
Some of the conversion work, some doesn't. :int32
-> :string
doesn't work, :int32
-> :float64
works. :/
(def c (col/new-column :a [1 2 3]))
c
;; => #tech.ml.dataset.column<object>[3]
;; :a
;; [1, 2, 3, ]
(type (first c))
;; => java.lang.Long
(dtype/->reader c :float64)
;; => [1.0 2.0 3.0]
(def cf (dtype/set-datatype c :float64))
cf
;; => #tech.ml.dataset.column<float64>[3]
;; :a
;; [1.000, 2.000, 3.000, ]
(type (first cf))
;; => java.lang.Double
(def cs (dtype/set-datatype c :string))
cs
;; => #tech.ml.dataset.column<string>[3]
;; :a
;; [1, 2, 3, ]
;; ?????
(type (first cs))
;; => java.lang.Long
I think the best I can do here is just explain why it is and we can talk about a higher level of what you really want.
Yes, set-datatype be too subtle really for a public API. If the data is numeric (packed-local-date, int float, etc) then set datatype will ensure the column creates the correct reader type with the correct missing value type. This works because tech.datatype knows about casts between numeric datatypes. It doesn't however, know about conversions between string and numbers or strings and dates or any of the infinite number of possible objects.
parse-column
specifically converts from strings to other datatypes. The rules are the same as in clojure, you can type (int a) and get an integer from a double but you can't type (int a) if a is a string. Conversion from string to something else is a different operation.
At a base level you want to be able to quickly change the column datatype and have something that understands these rules or optionally provide a function to use for the conversion. parse-column is close here as you can specify the result datatype and a function that may return :missing or :parse-failure for where the conversion fails. set-datatype
is more like a c or c++ cast which of course is just going to confuse people (but is necessary in the internals).
In this way, 'parse-column' is close to a map operation where you specify the resulting datatype of the result column as opposed to my first attempt at column-map
which derives the result datatype from the input column datatypes.
We are getting close to something good here. Also I just fixed the (dfn/log) type ops on integers produce errors. There are potentially another set of errors for things like bitwise-and when used for floating point numbers (should it round, should it floor, should it ceil?) . For sin,log,etc I think casting to a float or double is fine as the chance of loss of data is minimal. But for operations that are meant to happen only in integer space I think it should force the user to cast as casting from float to integer absolutely involves loss of data.
In order for any of this to be sane we need a solid way to convert column datatypes where you can specify the result datatype and optionally a conversion-fn and whether a conversion error results in missing values or if it results in an exception the propagates. That is slightly different than column-map or parse-column or set-datatype :-).
This is really great, I feel like we are making really solid progress with the stickiest part of the problem which is types and type conversions. What we don't have I think are good ways to force the type that have customized conversion routines with defined outcomes in the case of missing values and in the case where the conversion routine fails. That is really close to parse-column. The issue is that do you want this to be lazy in which case parse failures have to propagate or do you want this to be immediate and take perf hit from potentially a needless iteration over the entire column.
A work around is get-or-create-column where you create a new column (ds-col/new-column) with a reader of a defined type (dtype/make-reader; here you can do the conversion) and you set the missing values explicitly as part of the new-column pathway.
Thanks for deep explanation I think I've got some intuition about this. If something new will appear to be discussed we can move to Zulip (I didn't want to spam Zulip stream with bunch of code snippets). Yeah, we are close :) I see two missing pieces only:
datatype
lib accepts the result - conversion is done, if not - exception is thrown and user has to fix the conversion.:string
- straighforward application of str
.I agree that conversion from float -> int is not possible without user involvment. Possibly other.
Thanks for pushing this forward so quickly!
New release out that has a new column-cast
function.
This is a casting power tool. It will use the parsing system on string columns if cast-fn isn't provided, for instance. It will do the right thing with packed-local-date columns (convert them to local dates) before a conversion happens so your cast-fn doesn't have to understand packed-vs-unpacked data.
The function also parallelizes work when possible so things like (Math/round (double %))
are fast. Basically this is a generic function of (any datatype)->dest-datatype where you can provide the cast function. It is not at all lazy so you know if it succeeds then you have packed data in the datatype that you chose.
column-cast
looks great! Thank you! I hope we are done here :)
How to concat two columns with two datatypes, say
:int32
and:int64
I tried through
ds/concat
but it's not possible. I tried through the readers but the order is important and first datatype is selected (instead of the wider one).The issue is. I'm working on reshape enhancement (pivot-longer. Enhancement is ready and all examples work but one. There is a dataset
who
where you have 56 columns to be joined. They are read fromcsv
file and some of the columns get:int16
some get:int32
type. No option to join them. Even with (select-column-names
is my helper):