techascent / tech.datatype

Efficient numerics for the jvm
Eclipse Public License 1.0
83 stars 8 forks source link

Support for "reindexing" datetime column? #30

Closed lccambiaghi closed 4 years ago

lccambiaghi commented 4 years ago

Hi, me again, I managed to hit also this repo ahah!

I am playing with a datetime column in my dataset and I was wondering if you already plan to support "reindexing", by detecting the missing dates in a "date range", and possibly interpolating/filling them?

It sounds complicated so I would understand if it is something you are not interested in supporting!

Thank you once again for your amazing work!!

cnuernber commented 4 years ago

I am interested in anything :-).

Do you have a pointer to an existing pandas method?

Is this a vector->vector translation or is this a dataset->dataset translation?

I have needed this before in order to get smoother graphs for response times like for example when you have a set of responses and then no request for 5 minutes. In that case it would be useful to just duplicate the latest response time but add more values (like one every second or something).

lccambiaghi commented 4 years ago

API:

Low level calls:

Wow, pandas code is a mess. I wouldn't know where to begin even with these pointers! It is definitely dataset to dataset. Yes your example is a fairly common use of propagating last observation to fill gaps in the index (in pandas you would call reindex(method='ffill').

genmeblog commented 4 years ago

Hi! I've made filling missing with values from previous (or from next) possible value in tablecloth here: https://scicloj.github.io/tablecloth/index.html#replace - it's also candidate to be moved to tech.ml stack.

cnuernber commented 4 years ago

Great timing :-). Yes, then we could split this up into 'reindex' that sets new rows to empty and then a replace function. Very nice.

genmeblog commented 4 years ago

Implementing hint, RoaringBitmap has nice two functions: previousAbsentValue and nextAbsentValue to find a range of missing indexes as continuous range.

genmeblog commented 4 years ago

https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/missing.clj#L55

cnuernber commented 4 years ago

Low level attempt at this (operation works in double space): https://github.com/techascent/tech.datatype/blob/fill-range/src/tech/v2/datatype/functional.clj#L244

lccambiaghi commented 4 years ago

Chris, that looks great, thanks a lot! Waiting with patience for the support of datetime dtype!

cnuernber commented 4 years ago

They are sort of implicitly supported: https://github.com/techascent/tech.datatype/blob/master/src/tech/v2/datatype/datetime/operations.clj#L945

All datetime objects have a conversion to 64bit long milliseconds. When I implement this in dataset I will take care of that conversion (and back) automatically.

cnuernber commented 4 years ago

Marking this as fixed here and filing two new bugs in dataset: https://github.com/techascent/tech.ml.dataset/issues/116 https://github.com/techascent/tech.ml.dataset/issues/115

cnuernber commented 4 years ago

@lccambiaghi - tech.ml.dataset version 3.04 - When you have time it would be great to hear if tech.ml.dataset/fill-range-replace works for you.

@genmeblog - Copied your file over and exported the various functions into the tech.ml.dataset namespace. The difference here is a weaker column selection mechanism and no support for grouping. But the implementation has a public function that is private previously - replace-missing-with-strategy. The grouping and such I believe happens outside of that function. For now I wouldn't refactor; it may be better to have tablecloth work on more version of tech.ml.dataset than just the absolute most current but I did copy the code and it worked great. I may move your more sophisticated column selection criteria into tech.ml.dataset as I believe that is a solid and unambiguous upgrade to select and select-columns.

lccambiaghi commented 4 years ago

Stupid question.. how can I specify the 'span' when I want a "day" between each entry?

BTW interestingly this code

(->  (ds/->dataset {:dt [(java.time.LocalDateTime/of 2020 01 01 0 0 0)
                         (java.time.LocalDateTime/of 2020 01 05 0 0 0)]})
     (ds/fill-range-replace :dt 1))

results in

1. Unhandled java.lang.OutOfMemoryError
   Java heap space
cnuernber commented 4 years ago

Not a stupid question at all. You can create a java.time.Duration or if an integer it will be interpreted in milliseconds. The datetime interpolation happens in millisecond space also...potentially that could happen in microsecond space or be configurable.

Some minimal tests.

lccambiaghi commented 4 years ago

Amazing, thank you for the pointer, I used the handy (dtype-dt/milliseconds-in-day)! I am extremely happy with the solution, thank you so much!!

cnuernber commented 4 years ago

That is great and you are very welcome! Keep em coming :-)

genmeblog commented 4 years ago

Great! Thanks @cnuernber I will switch to migrated code asap. Regarding grouping and column selection - we can leave it in tablecloth for a while (or forever).

genmeblog commented 4 years ago

Hey Chris, why fill-range-replace not just fill-range?

cnuernber commented 4 years ago

Because it does the replace-missing operation just after the fill range operation. Honestly I would love a better name in general for fill-range. interpolate-spans also didn't seem very good. All these names seem bad to me or at least extremely obtuse.

genmeblog commented 4 years ago

Ah, you're right. I don't know the better name too. Pandas' reindex is also not good.