techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
680 stars 35 forks source link

Maybe a more generic `replace-missing` interface? #355

Closed kxygk closed 1 year ago

kxygk commented 1 year ago

I'm just getting started with using tech.m.dataset - so it's possible, as Steve Jobs says, I'm "holding it wrong"

I have data with holes and I'd like to fill them somehow. The problem is that the methods in replace-missing are .. kinda primitive. They seem to only work with under extremely basic sorted data (like a time series.. even then the interpolator doesn't take in account any time dimension). However the issue is that the given interface doesn't seem to allow a clear way to extend it with one's own methods. It's just a list of bake in methods from what I understand.

My data isn't meaningfully sorted. It's essentially just a list of N dimension vectors. So in my case I'd like to go to each nil, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.

I was wondering if it'd be possible to expose a more generic interface. It'd just be a function that iterates over the nils in the data set and the user would pass in a functor of the form:

(fn [dataset row nil-row-idx nil-col-idx] ;; user calculates a new value for the nil )

I'd assume the dataset would always be the original one - so that the nil replacing order doesn't matter. Maybe it'd already be filtered to rows that are non-nil in the same columns... I'll admit it's a bit of a half baked idea

Or maybe the current interface has some way for me to implement what I'd looking to do and I just don't get it (I think I will try to do it with row-map first)

harold commented 1 year ago

Welcome in! Thanks for the feedback.

I'd like to go to each nil, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.

Yes, the first way I'd try this is with row-map.

Thinking about it for a few seconds, the nearest queries may be complicated by the presence of the nils, but I suspect it's manageable.

I don't know of any name for the procedure you're describing here. Do you know if it has a name?

Definitely interested to see what you come up with. Report back when you try it.

genmeblog commented 1 year ago

Generally replace-missing works only on one column at the time (even if you select multiple columns, it will go one by one). As I described on Reddit this function gives a functionality from Pandas/Dplyr so should be enough for most cases. And yes, this functions is not extensible in a way you've described.

Anyway, here is the possible approach.

  1. To get indexes of rows with missing values call tech.v3.dataset.column/missing on a column. Missing values are kept in a RoaringBitmap structure. It's a sequence with some nice features. For example: .previousAbsentValue or .nextAbsentValue will give you index of non-missing row.
  2. With the above you can find the value matching your criteria.
  3. [undocumented] replace-missing with :value strategy accepts a map containing pairs: index-value to replace full set of values.
(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2  nil nil nil nil  nil 4   nil  11 nil nil]
                     :b [2   2   2 nil nil nil nil nil nil 13   nil   3  4  5 5]}))

DSm2
;; => _unnamed [15 2]:
;;    |   :a | :b |
;;    |-----:|---:|
;;    |      |  2 |
;;    |      |  2 |
;;    |      |  2 |
;;    |  1.0 |    |
;;    |  2.0 |    |
;;    |      |    |
;;    |      |    |
;;    |      |    |
;;    |      |    |
;;    |      | 13 |
;;    |  4.0 |    |
;;    |      |  3 |
;;    | 11.0 |  4 |
;;    |      |  5 |
;;    |      |  5 |

;; indexes of missing values
(col/missing (DSm2 :a)) ;; => {0,1,2,5,6,7,8,9,11,13,14}
(col/missing (DSm2 :b)) ;; => {3,4,5,6,7,8,10}

(class (col/missing (DSm2 :a))) ;; => org.roaringbitmap.RoaringBitmap

;; index of the nearest non-missing value in column `:a` starting from 0
(.nextAbsentValue (col/missing (DSm2 :a)) 0) ;; => 3
;; there is no previous non-missing
(.previousAbsentValue (col/missing (DSm2 :a)) 0) ;; => -1

;; replace some missing values by hand
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
;; => _unnamed [15 2]:
;;    |      :a | :b |
;;    |--------:|---:|
;;    |   100.0 |  2 |
;;    |  -100.0 |  2 |
;;    |         |  2 |
;;    |     1.0 |    |
;;    |     2.0 |    |
;;    |         |    |
;;    |         |    |
;;    |         |    |
;;    |         |    |
;;    |         | 13 |
;;    |     4.0 |    |
;;    |         |  3 |
;;    |    11.0 |  4 |
;;    |         |  5 |
;;    | -1000.0 |  5 |
cnuernber commented 1 year ago

@kxygk - have you solved your issues with the dataset library? Another approach would be to transform the dataset into a tensor and then create a compute tensor that does the right thing when the x,y location has a missing value.

kxygk commented 1 year ago

Thanks for checking in!

I did manage to accomplish it with 'row-map' - though from a purely ml perspective it unfortunately didn't give an improvement in the validation error (could be a peculiarity of my dataset though - it was for an ML class)

Sorry for the radio silence. I'm going to try to clean up the code a bit and put it up in a repo in the next few days. I always want to take a closer look at the other suggestions here :)

kxygk commented 1 year ago

I put up a very simple stripped down demo here: https://github.com/kxygk/caulk

This is just in case anyone is curious how I did it.. Sorry, I haven't been able to refactor this into a lil library or something directly useable :(

If I have a bit more free time towards the end of the summer I may revisit this

I think the core issue is still kinda unaddressed - so it's a bummer this issue has been closed. The replace-missing interface only provides a fixed set of cookie cutter methods instead of a generic interface for replacing missing value. That said, a preset list of functions for hole-filling could also be useful, but then I'd expect something more akin to the tech.v3.dataset.column-filters namespace. A function with predefined key flags feels a lil oldschool..

harold commented 1 year ago

Looks like a fun musical space you're working in. It's very cool that you were able to test the method you developed using the library.

I still suspect this method probably has a name and has been thought through quite thoroughly, Chris' original intuition that this sort of operation might best be done in a dense tensor space still strikes me as correct.

Sorry, I haven't been able to refactor this into a lil library or something directly useable :(

There's no need to apologize, you're of course right that this is the next step in the real work to be done here. What gets done in that step might suggest a way to generalize the api to be more conducive to these kinds of experiments (especially if that work is done with a little bit of an eye toward that potential generalization).

I think the program you linked sort of proves that the initial idea proposed here (changing the way replace-missing works) isn't that important, the experiment was run relatively easily with the existing api. And if it were someone's job at some point to run, say, a hundred such experiments, then the generalization would become super-obvious essentially immediately.

Your contribution is valued.