scicloj / tablecloth.time

Tools for the processing and manipulation of time-series data in Clojure.
Other
18 stars 1 forks source link

Add simplification for slice api #4

Closed ezmiller closed 3 years ago

ezmiller commented 3 years ago

Goal / Purpose

Simplify the slicing API so that a user does not have to call a specific method (e.g. slice-by-date) that matches the time unit they are working with on the index. This PR builds on #2.

Solution

The approach I took here works with the constraint that we've so far been sticking to, namely: that the user must be responsible for knowing that the unit that they wish to slice by matches the unit of the indexed column(s).

This constraint comes into play in this PR in relation to how we parse the datetime strings that the users provide to the slice method. In order to parse these strings we assume that their format should conform to that expected by the parse method of the time unit class of the index's keys. Specifically, we grab the class type of the first key of the index, and then use that type along with Clojure's multimethods to parse the string using the right java.time unit's parse method.

So now we can do something like this:

(-> data-by-year-month
    (index-by :year-month) ;; assuming the :year-month is a column with java.time.YearMonth date items
    (ops/slice "1949-01" "1949-03"))

While the following would fail:

(-> data-by-year-month
    (index-by :year-month)
    (ops/slice "1949-01-01" "1949-03-01"))

The PR also adds a few other things:

Work Remaining / Open Questions

Some pieces that may need work in the future:

daslu commented 3 years ago

Nice!!

daslu commented 3 years ago

One comment:

I think there is something misleading about the tests and code examples such as

(-> data-by-year-month
    (index-by :year-month)
    (ops/slice "1949-01-01" "1949-03-01"))

It would be inefficient to create an index just for one slicing operation. An index is useful under the assumption of multiple slicing operations.

Maybe we should eventually support a version of slicing that does not require creating an index.

ezmiller commented 3 years ago

One comment:

I think there is something misleading about the tests and code examples such as

(-> data-by-year-month
    (index-by :year-month)
    (ops/slice "1949-01-01" "1949-03-01"))

It would be inefficient to create an index just for one slicing operation. An index is useful under the assumption of multiple slicing operations.

Maybe we should eventually support a version of slicing that does not require creating an index.

That makes sense. You mean a version of slicing that is able to achieve the slice without an index at all, even under the hood? Is this purely a performance/optimization concern, or also syntactical in the sense that it is weird to require the user to first call index on a dataset before slicing?

daslu commented 3 years ago

You mean a version of slicing that is able to achieve the slice without an index at all, even under the hood?

Yes.

Is this purely a performance/optimization concern, or also syntactical in the sense that it is weird to require the user to first call index on a dataset before slicing?

The concern is mainly about performance, so I think there is no rush with this option.

ezmiller commented 3 years ago

@daslu okay understood. I don't want to get too tied up here, and I think in some sense we can afford to merge this and move on. There's nothing stopping us from adding to it/modifying it in the future. Do you agree?

That said, your question inadvertently ended up raising a few more questions for me, but I'll discuss in Zulip (and add link here) since I think there's more visibility there at the moment...