Closed lccambiaghi closed 4 years ago
I am interested in anything :-).
Do you have a pointer to an existing pandas method?
Is this a vector->vector translation or is this a dataset->dataset translation?
I have needed this before in order to get smoother graphs for response times like for example when you have a set of responses and then no request for 5 minutes. In that case it would be useful to just duplicate the latest response time but add more values (like one every second or something).
API:
Low level calls:
Wow, pandas code is a mess. I wouldn't know where to begin even with these pointers! It is definitely dataset to dataset. Yes your example is a fairly common use of propagating last observation to fill gaps in the index (in pandas you would call reindex(method='ffill').
Hi! I've made filling missing with values from previous (or from next) possible value in tablecloth
here: https://scicloj.github.io/tablecloth/index.html#replace - it's also candidate to be moved to tech.ml stack.
Great timing :-). Yes, then we could split this up into 'reindex' that sets new rows to empty and then a replace function. Very nice.
Implementing hint, RoaringBitmap has nice two functions: previousAbsentValue
and nextAbsentValue
to find a range of missing indexes as continuous range.
Low level attempt at this (operation works in double space): https://github.com/techascent/tech.datatype/blob/fill-range/src/tech/v2/datatype/functional.clj#L244
Chris, that looks great, thanks a lot! Waiting with patience for the support of datetime dtype!
They are sort of implicitly supported: https://github.com/techascent/tech.datatype/blob/master/src/tech/v2/datatype/datetime/operations.clj#L945
All datetime objects have a conversion to 64bit long milliseconds. When I implement this in dataset I will take care of that conversion (and back) automatically.
Marking this as fixed here and filing two new bugs in dataset: https://github.com/techascent/tech.ml.dataset/issues/116 https://github.com/techascent/tech.ml.dataset/issues/115
@lccambiaghi - tech.ml.dataset version 3.04
- When you have time it would be great to hear if tech.ml.dataset/fill-range-replace
works for you.
@genmeblog - Copied your file over and exported the various functions into the tech.ml.dataset namespace. The difference here is a weaker column selection mechanism and no support for grouping. But the implementation has a public function that is private previously - replace-missing-with-strategy
. The grouping and such I believe happens outside of that function. For now I wouldn't refactor; it may be better to have tablecloth work on more version of tech.ml.dataset than just the absolute most current but I did copy the code and it worked great. I may move your more sophisticated column selection criteria into tech.ml.dataset as I believe that is a solid and unambiguous upgrade to select
and select-columns
.
Stupid question.. how can I specify the 'span' when I want a "day" between each entry?
BTW interestingly this code
(-> (ds/->dataset {:dt [(java.time.LocalDateTime/of 2020 01 01 0 0 0)
(java.time.LocalDateTime/of 2020 01 05 0 0 0)]})
(ds/fill-range-replace :dt 1))
results in
1. Unhandled java.lang.OutOfMemoryError
Java heap space
Not a stupid question at all. You can create a java.time.Duration or if an integer it will be interpreted in milliseconds. The datetime interpolation happens in millisecond space also...potentially that could happen in microsecond space or be configurable.
Some minimal tests.
Amazing, thank you for the pointer, I used the handy (dtype-dt/milliseconds-in-day)
! I am extremely happy with the solution, thank you so much!!
That is great and you are very welcome! Keep em coming :-)
Great! Thanks @cnuernber I will switch to migrated code asap.
Regarding grouping and column selection - we can leave it in tablecloth
for a while (or forever).
Hey Chris, why fill-range-replace
not just fill-range
?
Because it does the replace-missing operation just after the fill range operation. Honestly I would love a better name in general for fill-range. interpolate-spans also didn't seem very good. All these names seem bad to me or at least extremely obtuse.
Ah, you're right. I don't know the better name too. Pandas' reindex
is also not good.
Hi, me again, I managed to hit also this repo ahah!
I am playing with a datetime column in my dataset and I was wondering if you already plan to support "reindexing", by detecting the missing dates in a "date range", and possibly interpolating/filling them?
It sounds complicated so I would understand if it is something you are not interested in supporting!
Thank you once again for your amazing work!!