unexpected behavior for missing data

damianr99 commented 3 years ago

Not sure if this is intentional, but I found this behavior surprising:

(dfn/+ (:a (ds/->dataset {:a [1 2 nil 4]})) 1)
[2 3 -9223372036854775807 5]

(dfn/+ (:a (ds/->dataset {:a [1 2.0 nil 4]})) 1)
[2.0 3.0 ##NaN 5.0]

Possibly related, missing values don't sort (I'd expect them to either move to the front, or end, but they don't budge)

(ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a <)
_unnamed [7 1]:

|   :a |
|------|
| -Inf |
|  NaN |
|  1.0 |
|  2.0 |
|      |
|  4.0 |
|  Inf |
main> (ds/sort-by-column (ds/->dataset {:a [1 ##NaN 2 ##Inf nil 4 ##-Inf]}) :a >)
_unnamed [7 1]:

|   :a |
|------|
|  Inf |
|  NaN |
|  4.0 |
|  2.0 |
|      |
|  1.0 |
| -Inf |
main>

This is using version "6.010". Thanks!

cnuernber commented 3 years ago

These are great.

The first is half-intentional - dfn is a namespace that is much lower level than the dataset column namespace and it has no knowledge of missing. The official recommendation is to clear out missing values before you start to do numeric processing on the dataset. The deeper fix would be to have the dtype-next architecture know about missing values and use float64 or object space if the column has any missing values as either nil or nan are valid missing value numbers. Because there is no :int64 nan equivalent I use Long/MIN_VALUE when I have to write a long into an array of data and dfn picks this up. There would be similar issues for any of the other integer types. The tack I took here for tmdjs is all math for numeric columns is done in float64 space and thus nan is always an option so this issue at least for the clojurescript version has a solid answer.
The second (sorting) is definitely confusing and I agree with your analysis - especially when sorting by column missing either goes first or last - we should check pandas and do whatever they do.

Both valid points and the first especially is irksome and potentially corrupting.

damianr99 commented 3 years ago

Apologies, I didn't see the recommendation to clear out missing values first. I was copying from the examples (e.g. https://techascent.github.io/tech.ml.dataset/walkthrough.html#elementwise-operations). A number of the dataset examples in the documentation drop down to using tech.v3.datatype functions. It's a little unclear for a new user what the pitfalls of doing that are.

cnuernber commented 3 years ago

I don't know if that recommendation is documented it just has been discussed on zulip. There certainly are pitfalls :-).

Here are some things that may help this situation

I think we can mitigate 1. by adding a protocol method to dtype-next which is operational-elemwise-datatype vs. elemwise-datatype with the distinction being that some containers may have to advertise a more general datatype than the specific container type in order to correctly interpret both values that can be represented by the elemwise-datatype and values that cannot be.

For numeric types, the operational datatype if there were missing values would be :float64 else the operation datatype would match the actual datatype. Then update the code in dispatch.clj to respect such things and at least all of the math operations in tech.v3.datatype.functional would work as correctly as possible with missing values.

For the second (sorting of nil values) perhaps we have a new option for sort - {:missing-policy #{:first :last :exception}} which defaults to whatever pandas does and then at least the result format will be standardized.

And finally the documentation could really be improved here especially for first time users. I think the tablecloth project is much further along this pathway and that is the current focus of the scicloj team.

techascent / tech.ml.dataset

unexpected behavior for missing data #260