holgerbrandl / krangl

krangl is a {K}otlin DSL for data w{rangl}ing
MIT License
561 stars 50 forks source link

Support ZonedDateTime and LocalDateTime DataCol #87

Open alphaho opened 4 years ago

alphaho commented 4 years ago

I've been using Pandas for my data manipulation for quite some time and would very like to switch to Kotlin with krangl as it has much better type system support.

When I tried to port one of my pandas script to krangl, I've found it lacks support on ZonedDateTime, LocalDateTime and Duration as a DataCol. And I need to use a lot of mapping and casting to work around it. Which is not straightforward enough.

For example, pandas has support for:

It would be much better if we can have such capabilities included in the library.

holgerbrandl commented 4 years ago

Great suggestion which is on the roadmap already.

The most tricky question would be which type to use internally in a future DateCol. To keep it simple I'd think that a single format should be supported only (if possible). https://stackoverflow.com/questions/32437550/whats-the-difference-between-instant-and-localdatetime gives a great overview, but I'm still not sure which one would be most generic to support all usecases.

To me Instant seems most versatile and can be mapped to timezone as detailed out in https://mkyong.com/java8/java-convert-instant-to-localdatetime/

Concerning your usecases: Most of them make total sense to me. However, I struggle with multiplication (last one in your list).

The difference of two DateCols could be by convention either a typed representation (such as Period/Duration) or simple an int/long (millisecond difference). Not sure which one is more intuitive.

What do you think?

alphaho commented 4 years ago

Great to know it's already on the roadmap!

I agree that Instant would be the best choice to use internally in a DataCol for representing time in general. But we may also need a few more typed DataCols to support Duration, LocalTime, etc. So that we may provide a better out-of-the-box experience.

After using krangl for over a month, I've found that I often need to do quite some type conversion myself before manipulating the data. So for the last usecase(the multiplication), I think it would be way better if we can favor typed representation and require less work from the user to get the job done. But I do agree that it may not be very scalable as we may need to support so many types and so many different operations on each time.

holgerbrandl commented 4 years ago

I guess with more types being added, the API potentially could use an overhaul to rather support some more generic column type provider that implements all basic operations. This would e.g. allow users to register own types for improved convenience. However, I'm not yet so sure about how to implement such a feature.

Regarding the type conversions: I agree it's not so straight forward as I'm used to from R for example. On one hand, it should be somehow typed to provide sensible completion but on the other, too much typing requires casting in many situations. Feel welcome if you have ideas about how to solve this more elegantly.