scicloj / tablecloth.time

Tools for the processing and manipulation of time-series data in Clojure.
Other
18 stars 1 forks source link

Decide how to handle gaps in java.time support in tech.v3.datatype.datetime #38

Open ezmiller opened 3 years ago

ezmiller commented 3 years ago

tech.v3.datatype.datetime has a narrow list of java.time classes it supports. My understanding is that the reasons for this are mixture of principle and practicality. There's a general tendency in tech.datatype's datetime support to avoid awareness of the distinct time classes (unlike e.g. the tick time stack). There's are also java.time classes such as java.time.Year that one might arguably better treat as just a number. Then also adding support in functions like descriptive-statistics and tech.v3.datatype.functional for all these different classes is a lot of work.

That said, it can feel as though there are gaps in tech.v3.datatype's datetime support. For example, you cannot do this:

(tech.v3.datatype.datetime/plus-temporal-amount #time/year "1970" (range 10) :years)

although it might feel natural to do so. You get an error that reads:

Data datatype (:year) is not a date time datatype.

which could be confusing, since intuitively someone might think it should be, even if they don't know that java.time.Year is a class.

In tablecloth.time, we want to make a very smooth experience for beginners, so this is a problem for us. Right now, a clear solution doesn't present itself and more experimentation is probably needed. Each of the classes that tech.datatype does not support may provide unique problems, as well.

Some of those are:

And then also extra classes provided by org.threeten such as YearQuarter.

ezmiller commented 3 years ago

After a chat with @cnuernber , I think we can think of it in this way for now:

They key idea: Because we have chosen to follow the philosophy of tech.v3.datetype.datetime, which tries to minimize the use and awareness of distinct types (read: classes) of time, we should try to not use distinct types for these classes.

From there we can think of this in a prioritized way:

  1. In short term, we can encourage the use of two rows, both numbers, to manage year-months or year-quarters.
  2. Eventually, we may be able to extend tech.ml.dataset to support these types. @cnuernber described an approach that could be explored that would manage year months in terms of epoch-months:

Perhaps the datetime system in datatype needs to be extensible to new temporal types and then I think working through making year-month or something like that work would be good but you would be in pure datetime land. For year, month you could just use epoch-months and have a single integer that you then built more operations (such as get-year, etc) from. Then you have +,-,<,>, etc basic operations working, serialization to arrow or parquet, etc.

In an ideal system you could get all of that by defining a conversion from year-month to epoch-month and lots of things would 'just work' but that is a serious type engineering of the type you need a better type resolution system than anything I wrote in dtype next.

Re #1, the question may be: