Closed kxygk closed 1 year ago
Welcome in! Thanks for the feedback.
I'd like to go to each nil, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.
Yes, the first way I'd try this is with row-map
.
Thinking about it for a few seconds, the nearest queries may be complicated by the presence of the nil
s, but I suspect it's manageable.
I don't know of any name for the procedure you're describing here. Do you know if it has a name?
Definitely interested to see what you come up with. Report back when you try it.
Generally replace-missing
works only on one column at the time (even if you select multiple columns, it will go one by one). As I described on Reddit this function gives a functionality from Pandas/Dplyr so should be enough for most cases. And yes, this functions is not extensible in a way you've described.
Anyway, here is the possible approach.
tech.v3.dataset.column/missing
on a column. Missing values are kept in a RoaringBitmap structure. It's a sequence with some nice features. For example: .previousAbsentValue
or .nextAbsentValue
will give you index of non-missing row.replace-missing
with :value
strategy accepts a map containing pairs: index-value to replace full set of values.(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2 nil nil nil nil nil 4 nil 11 nil nil]
:b [2 2 2 nil nil nil nil nil nil 13 nil 3 4 5 5]}))
DSm2
;; => _unnamed [15 2]:
;; | :a | :b |
;; |-----:|---:|
;; | | 2 |
;; | | 2 |
;; | | 2 |
;; | 1.0 | |
;; | 2.0 | |
;; | | |
;; | | |
;; | | |
;; | | |
;; | | 13 |
;; | 4.0 | |
;; | | 3 |
;; | 11.0 | 4 |
;; | | 5 |
;; | | 5 |
;; indexes of missing values
(col/missing (DSm2 :a)) ;; => {0,1,2,5,6,7,8,9,11,13,14}
(col/missing (DSm2 :b)) ;; => {3,4,5,6,7,8,10}
(class (col/missing (DSm2 :a))) ;; => org.roaringbitmap.RoaringBitmap
;; index of the nearest non-missing value in column `:a` starting from 0
(.nextAbsentValue (col/missing (DSm2 :a)) 0) ;; => 3
;; there is no previous non-missing
(.previousAbsentValue (col/missing (DSm2 :a)) 0) ;; => -1
;; replace some missing values by hand
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
;; => _unnamed [15 2]:
;; | :a | :b |
;; |--------:|---:|
;; | 100.0 | 2 |
;; | -100.0 | 2 |
;; | | 2 |
;; | 1.0 | |
;; | 2.0 | |
;; | | |
;; | | |
;; | | |
;; | | |
;; | | 13 |
;; | 4.0 | |
;; | | 3 |
;; | 11.0 | 4 |
;; | | 5 |
;; | -1000.0 | 5 |
@kxygk - have you solved your issues with the dataset library? Another approach would be to transform the dataset into a tensor and then create a compute tensor that does the right thing when the x,y location has a missing value.
Thanks for checking in!
I did manage to accomplish it with 'row-map' - though from a purely ml perspective it unfortunately didn't give an improvement in the validation error (could be a peculiarity of my dataset though - it was for an ML class)
Sorry for the radio silence. I'm going to try to clean up the code a bit and put it up in a repo in the next few days. I always want to take a closer look at the other suggestions here :)
I put up a very simple stripped down demo here: https://github.com/kxygk/caulk
This is just in case anyone is curious how I did it.. Sorry, I haven't been able to refactor this into a lil library or something directly useable :(
If I have a bit more free time towards the end of the summer I may revisit this
I think the core issue is still kinda unaddressed - so it's a bummer this issue has been closed. The replace-missing
interface only provides a fixed set of cookie cutter methods instead of a generic interface for replacing missing value. That said, a preset list of functions for hole-filling could also be useful, but then I'd expect something more akin to the tech.v3.dataset.column-filters
namespace. A function with predefined key flags feels a lil oldschool..
Looks like a fun musical space you're working in. It's very cool that you were able to test the method you developed using the library.
I still suspect this method probably has a name and has been thought through quite thoroughly, Chris' original intuition that this sort of operation might best be done in a dense tensor space still strikes me as correct.
Sorry, I haven't been able to refactor this into a lil library or something directly useable :(
There's no need to apologize, you're of course right that this is the next step in the real work to be done here. What gets done in that step might suggest a way to generalize the api to be more conducive to these kinds of experiments (especially if that work is done with a little bit of an eye toward that potential generalization).
I think the program you linked sort of proves that the initial idea proposed here (changing the way replace-missing
works) isn't that important, the experiment was run relatively easily with the existing api. And if it were someone's job at some point to run, say, a hundred such experiments, then the generalization would become super-obvious essentially immediately.
Your contribution is valued.
I'm just getting started with using
tech.m.dataset
- so it's possible, as Steve Jobs says, I'm "holding it wrong"I have data with holes and I'd like to fill them somehow. The problem is that the methods in
replace-missing
are .. kinda primitive. They seem to only work with under extremely basic sorted data (like a time series.. even then the interpolator doesn't take in account any time dimension). However the issue is that the given interface doesn't seem to allow a clear way to extend it with one's own methods. It's just a list of bake in methods from what I understand.My data isn't meaningfully sorted. It's essentially just a list of N dimension vectors. So in my case I'd like to go to each
nil
, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.I was wondering if it'd be possible to expose a more generic interface. It'd just be a function that iterates over the nils in the data set and the user would pass in a functor of the form:
(fn [dataset row nil-row-idx nil-col-idx] ;; user calculates a new value for the nil )
I'd assume the
dataset
would always be the original one - so that the nil replacing order doesn't matter. Maybe it'd already be filtered to rows that are non-nil in the same columns... I'll admit it's a bit of a half baked ideaOr maybe the current interface has some way for me to implement what I'd looking to do and I just don't get it (I think I will try to do it with
row-map
first)