techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
679 stars 35 forks source link

replace-missing strategy :mid result is same as :down #184

Closed kimim closed 3 years ago

kimim commented 3 years ago

In below test, when replace-missing with :mid strategy, I expected to get middle value between two valid numbers. But the result shows that :down strategy is used:

  (let [ds (ds/->dataset {:a [nil 2 nil 4 nil 6 nil]
                          :b [3 nil 6 nil 9 nil 12]
                          :c [nil "A" nil "B" nil "C" nil]
                          :d ["A" nil "B" nil "C" nil "D"]
                          })
        ds' (ds/replace-missing ds :mid)]
    (is (= [2 2 3 4 5 6 6] (vec (ds' :a))))
    (is (= [3 5 6 8 9 11 12] (vec (ds' :b))))
    (is (= ["A" "A" "A" "B" "B" "C" "C"] (vec (ds' :c))))
    (is (= ["A" "A" "B" "B" "C" "C" "D"] (vec (ds' :d))))
    ds')
;; => _unnamed [7 4]:

| :a | :b | :c | :d |
|----|----|----|----|
|  2 |  3 |  A |  A |
|  2 |  3 |  A |  A |
|  2 |  6 |  A |  B | 
|  4 |  6 |  B |  B |
|  4 |  9 |  B |  C |
|  6 |  9 |  C |  C |
|  6 | 12 |  C |  D |

  (let [ds (ds/->dataset {:a [nil 2. nil 4. nil 6. nil]
                          :b [3. nil 6. nil 9. nil 12.]
                          :c [nil "A" nil "B" nil "C" nil]
                          :d ["A" nil "B" nil "C" nil "D"]
                          })
        ds' (ds/replace-missing ds :mid)]
    (is (= [2.0 2.0 3.0 4.0 5.0 6.0 6.0] (vec (ds' :a))))
    (is (= [3.0 5.0 6.0 7.5 9.0 10.5 12.0] (vec (ds' :b))))
    (is (= ["A" "A" "A" "B" "B" "C" "C"] (vec (ds' :c))))
    (is (= ["A" "A" "B" "B" "C" "C" "D"] (vec (ds' :d))))
    ds')
;; => _unnamed [7 4]:

|  :a |   :b | :c | :d |
|-----|------|----|----|
| 2.0 |  3.0 |  A |  A |
| 2.0 |  3.0 |  A |  A |
| 2.0 |  6.0 |  A |  B |
| 4.0 |  6.0 |  B |  B |
| 4.0 |  9.0 |  B |  C |
| 6.0 |  9.0 |  C |  C |
| 6.0 | 12.0 |  C |  D |
cnuernber commented 3 years ago

I agree there are are a valid issues here.

Hmm, it seems to me that it isn't exactly down. I am not sure what mid should do with strings but it seems for strings it does default to 'down' which seems OK to me. For the integer and floating point numbers it just looks like it is not exactly right as the number in between 3,6 is 4.5, not 5.

For integer columns I think mid should first convert the column to floating point space and then work in there. This means that replacing missing on integer columns using the :mid pathway will result in columns changing datatypes which in this case I think is more correct than ending with the completely wrong answer.

So I think there are three separate issues that should be addressed:

  1. Integer columns should be coerced to floating point columns if mid is used. This happens elsewhere for instance if you apply tech.v3.datatype.functional/log to anything it will result in a floating point representation.
  2. Floating point midpoints are wrong in some cases.
  3. What is the desired behavior if the column is not numeric? Is down a reasonable default?

Lots of good stuff here; great issue. I really appreciate this careful consideration.

cnuernber commented 3 years ago

I added a test that I think covers the intended effects of :mid across a range of datatypes.

It's up in the latest release.

genmeblog commented 3 years ago

Now I have regression bugs because of this.

Strategy :mid wasn't a "middle value between two valid numbers". It was meant to be combined :up and :down and drag values to meet in the middle.

Example:

;  2 > 2 > 2 > 2 3 < 3 < 3 < 3
[1 2 nil nil nil nil nil nil 3]
;               ^
;               | a mid point

;; result 

[1 2 2 2 2 3 3 3 3]

There was no strategy as described here. I would revert :mid to the original one and introduced new called :midval or something like that.

Another bug is in strategy :lerp. int type results in float.

My failing test cases:

More examples: https://scicloj.github.io/tablecloth/index.html#replace

@kimim replace-missing was originally in tablecloth and was moved to tech.ml.dataset at some point.

cnuernber commented 3 years ago

Ah, sorry, i will revert changes mid changes and go back to original behavior.

I changed lerp to work in float space to be mathematically correct, same as if you call dfn/log or something like that. I think this is reasonable; what do you think?

kimim commented 3 years ago

Now I understand the :mid strategy. Sorry for the mistaken issue.

genmeblog commented 3 years ago

Hmmm... The question is if preserving the datatype is expected or not. But I don't have a strong opinion on that and lerp can change datatype for me.

What about :midval - I think this is also great strategy to consider.

cnuernber commented 3 years ago

I agree abour 'midval'.

kimim commented 3 years ago

how about :center for :mid, and :mid for :midval?

genmeblog commented 3 years ago

:mid was my own invention, it's not any naming standard, so it can be changed.

kimim commented 3 years ago

I do not have strong bias. Both works for me. Thanks.

cnuernber commented 3 years ago

:mid corresponds to :nearest in left-join-asof. :midpoint or :midval as was suggested make the most sense to me.

If no one strongly objects, I will change :mid to :nearest and add :midval.

genmeblog commented 3 years ago

Perfect!

jsa-aerial commented 3 years ago

I'm not going to strongly object, but I tend to favor RH's notion that one should always try to never change things in a non backward compatible way. So in this case, I would add :nearest and note that :mid is an alias. I'm not going to object, because I don't think this will affect me :)

cnuernber commented 3 years ago

That is a good point. I will keep :mid :-). I also agree that we should not change things in non-backwards-compatible ways.

cnuernber commented 3 years ago

Done and documented

kimim commented 3 years ago

That's great. Glad to see that tech.ml.dataset is better and better everyday.

genmeblog commented 3 years ago

Yeah! :) tablecloth is updated too.