Add an initial attempt at an index

ezmiller commented 3 years ago

Goal / Problem

We want to experiment with building-up time series data manipulations as a way to build this time series library. For that, we think we need an "index", where what we mean by an index is probably something like this:

a data structure that allows us to refer to certain rows or ranges of rows in a dataset;
a data structure that supports methods for optimizing certain types of operations on a set of data (e.g. subsetting, resampling etc);
a data structure that includes an awareness of some kind of order (e.g. temporal, alphabetical, etc)

This definition is very likely incomplete or off-kilter. We are still trying to build consensus on what we mean here. For a discussion about this, see here.

Solution

For this PR, we are just adding a very naive/simple approach to an index using java.util.TreeMap. We are aware that this may not be the most efficient implementation, and that the index may itself need to be moved/merged into libraries farther "up" the stack, i.e. if we are imagining a stack that looks like this tech.ml.dataset -> tablecloth -> tablecloth.time, then the index for which this PR is a start, may ultimately live in tablecloth or even tech.ml.dataset.

This PR supports the following kind of usage:

(-> data
    (index-by :date)
    (slice "1949-01-01" "1949-07-01"))

What's going on here? The index-by function calls a function make-index that builds a TreeMap where the keys are the date values in the row :date and the values are rows in the dataset. Then slice uses those dates to extract a window of that data. e.g. it is comparable to the pythonic/pandas data['2014-07-04':'2015-07-04'].

Open Questions

None of the syntax here is meant to be anything more than an initial suggestion. index-by comes from R's tsibble package; slice as noted above emerged more from the python usage. It's fair to say from the discussion linked above that there's a lack of consensus on exactly how we should think about indexing. This PR doesn't imply an opinion about that yet, but just gets us started.

Still, what do we mean by an index is a good question. The R tsibble library describes their index_by function as the "counterpart of group_by() in temporal context". In the discussions we've had there has been a question as to whether an index is just a group_by operation (see here). That is, maybe an index is not necessary? @genmeblog summarized this this way:

The question is: is it necessary? Implicit index_by(Year_Month = ~ yearmonth(.)) vs explicit (group-by ds (fn [row] (yearmonth (:index-column row))))

comment link

If we put the question like this, we are saying (I think) an index is necessary (or not) because it is a useful abstraction and a concept that just makes sense in certain contexts.

The other idea we have, though, is that indexes are important as a tool for optimization. I.e. we used a TreeMap here, which should provide logarithmic time for most lookups we'd do in various operations. Using a TreeMap, however, might not be the fastest way to go. We might have speedier abstractions/operations available to us in tech.datatype (e.g. here).

This is all just to say that we will probably want to consider how we define index independently in these two domain: 1) how we optimize the kinds of queries that indexes typically support, and 2) what kind of api/syntax we make available to make the use of the kinds of operations that indexes support accessible.

I hope that made some sense :).

daslu commented 3 years ago

Looks great.

AFAIK, the has-meta? check is not necessary, and you could always use with-meta.

https://github.com/scicloj/tablecloth.time/pull/1/files#diff-48171233bb7ab91b23c58e404b1a4377e0792063c93b0935c26c2e7bf2ddd5aeR20

ezmiller commented 3 years ago

@daslu thanks. Curious why you say the 'has-meta?' check is not necessary? If I didn't add it, it overrode existing metadata already on the dataset, e.g. :name.

Looking forward to pushing onward with this today!

daslu commented 3 years ago

@ezmiller I meant to say that we could use vary-meta in all cases. The use of with-meta where the metadata is nil was not necessary, as far as I understand.

ezmiller commented 3 years ago

Oh that makes sense! Good point. Reviews are good :)

scicloj / tablecloth.time