techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
660 stars 33 forks source link

unexpected behavior for missing data #260

Closed damianr99 closed 3 years ago

damianr99 commented 3 years ago

Not sure if this is intentional, but I found this behavior surprising:

(dfn/+ (:a (ds/->dataset {:a [1 2 nil 4]})) 1)
[2 3 -9223372036854775807 5]

(dfn/+ (:a (ds/->dataset {:a [1 2.0 nil 4]})) 1)
[2.0 3.0 ##NaN 5.0]

Possibly related, missing values don't sort (I'd expect them to either move to the front, or end, but they don't budge)

(ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a <)
_unnamed [7 1]:

|   :a |
|------|
| -Inf |
|  NaN |
|  1.0 |
|  2.0 |
|      |
|  4.0 |
|  Inf |
main> (ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a >)
_unnamed [7 1]:

|   :a |
|------|
|  Inf |
|  NaN |
|  4.0 |
|  2.0 |
|      |
|  1.0 |
| -Inf |
main> 

This is using version "6.010". Thanks!

cnuernber commented 3 years ago

These are great.

Both valid points and the first especially is irksome and potentially corrupting.

damianr99 commented 3 years ago

Apologies, I didn't see the recommendation to clear out missing values first. I was copying from the examples (e.g. https://techascent.github.io/tech.ml.dataset/walkthrough.html#elementwise-operations). A number of the dataset examples in the documentation drop down to using tech.v3.datatype functions. It's a little unclear for a new user what the pitfalls of doing that are.

cnuernber commented 3 years ago

I don't know if that recommendation is documented it just has been discussed on zulip. There certainly are pitfalls :-).

Here are some things that may help this situation

I think we can mitigate 1. by adding a protocol method to dtype-next which is operational-elemwise-datatype vs. elemwise-datatype with the distinction being that some containers may have to advertise a more general datatype than the specific container type in order to correctly interpret both values that can be represented by the elemwise-datatype and values that cannot be.

For numeric types, the operational datatype if there were missing values would be :float64 else the operation datatype would match the actual datatype. Then update the code in dispatch.clj to respect such things and at least all of the math operations in tech.v3.datatype.functional would work as correctly as possible with missing values.

For the second (sorting of nil values) perhaps we have a new option for sort - {:missing-policy #{:first :last :exception}} which defaults to whatever pandas does and then at least the result format will be standardized.

And finally the documentation could really be improved here especially for first time users. I think the tablecloth project is much further along this pathway and that is the current focus of the scicloj team.