Closed ezmiller closed 3 years ago
Looks great.
AFAIK, the has-meta?
check is not necessary, and you could always use with-meta
.
@daslu thanks. Curious why you say the 'has-meta?' check is not necessary? If I didn't add it, it overrode existing metadata already on the dataset, e.g. :name
.
Looking forward to pushing onward with this today!
@ezmiller I meant to say that we could use vary-meta
in all cases.
The use of with-meta
where the metadata is nil
was not necessary, as far as I understand.
Oh that makes sense! Good point. Reviews are good :)
Goal / Problem
We want to experiment with building-up time series data manipulations as a way to build this time series library. For that, we think we need an "index", where what we mean by an index is probably something like this:
This definition is very likely incomplete or off-kilter. We are still trying to build consensus on what we mean here. For a discussion about this, see here.
Solution
For this PR, we are just adding a very naive/simple approach to an index using
java.util.TreeMap
. We are aware that this may not be the most efficient implementation, and that the index may itself need to be moved/merged into libraries farther "up" the stack, i.e. if we are imagining a stack that looks like thistech.ml.dataset -> tablecloth -> tablecloth.time
, then the index for which this PR is a start, may ultimately live in tablecloth or even tech.ml.dataset.This PR supports the following kind of usage:
What's going on here? The
index-by
function calls a functionmake-index
that builds a TreeMap where the keys are the date values in the row:date
and the values are rows in the dataset. Thenslice
uses those dates to extract a window of that data. e.g. it is comparable to the pythonic/pandasdata['2014-07-04':'2015-07-04']
.Open Questions
None of the syntax here is meant to be anything more than an initial suggestion.
index-by
comes from R'stsibble
package; slice as noted above emerged more from the python usage. It's fair to say from the discussion linked above that there's a lack of consensus on exactly how we should think about indexing. This PR doesn't imply an opinion about that yet, but just gets us started.Still, what do we mean by an index is a good question. The R tsibble library describes their
index_by
function as the "counterpart of group_by() in temporal context". In the discussions we've had there has been a question as to whether an index is just agroup_by
operation (see here). That is, maybe an index is not necessary? @genmeblog summarized this this way:comment link
If we put the question like this, we are saying (I think) an index is necessary (or not) because it is a useful abstraction and a concept that just makes sense in certain contexts.
The other idea we have, though, is that indexes are important as a tool for optimization. I.e. we used a TreeMap here, which should provide logarithmic time for most lookups we'd do in various operations. Using a TreeMap, however, might not be the fastest way to go. We might have speedier abstractions/operations available to us in tech.datatype (e.g. here).
This is all just to say that we will probably want to consider how we define index independently in these two domain: 1) how we optimize the kinds of queries that indexes typically support, and 2) what kind of api/syntax we make available to make the use of the kinds of operations that indexes support accessible.
I hope that made some sense :).