koaning / clumper

A small python library that can clump lists of data together.
https://koaning.github.io/clumper/
MIT License
147 stars 15 forks source link

Datetime utilities #82

Open koaning opened 3 years ago

koaning commented 3 years ago

It would be nice if we could group-by day/week/hour given a timestamp.

We should first discuss a proper API before making an implementation but this would really be a nice feature.

koaning commented 3 years ago

We could add an optional dependency and an optional helper for datetime parsing using datefinder. Given that we have dates in a proper iso-standard it would be nice to have some other helpers to aid in rounding. Something like;

(clump.mutate(round_s=round_dt(key='dt', by='second'))

We could have variants of it too.

round_dt(key='dt', by='minute')
round_dt(key='dt', by='hour')
round_dt(key='dt', by='day')
samarpan-rai commented 3 years ago

This sounds like an interesting feature to me as well! How about we take inspiration from Grouper object in pandas?

The responsibility of Grouper would be to find and mark the rows in the data that the given frequency. These marked information can be used by the .agg(...) to compute the aggregations.

I can imagine its use like this

from clumper import Clumper, Grouper

sensor_data = [
    {'date_time':'2020-01-01 19:01','temperature':30, 'humidity':21},
    {'date_time':'2020-01-01 19:59','temperature':25, 'humidity':22},
    {'date_time':'2020-01-01 20:01','temperature':25, 'humidity':40},
    {'date_time':'2020-01-01 20:59','temperature':20, 'humidity':22},
]

clump = Clumper(sensor_data)

hourly_average = (
    clump
    .group_by(Grouper(key='date_time',freq='H')) # For start frequency can be hourly, daily, and monthly 
    .agg(
        avg_temperature = ('temperature','mean'),
        avg_humdity = ('humdity','mean')
        )

)

hourly_average.collect() 
"""
{'date_time':'2020-01-01 19:00','avg_temperature':27.5, 'avg_humidity':21.5},
{'date_time':'2020-01-01 20:00','avg_temperature':20, 'avg_humidity':31}
""
koaning commented 3 years ago

What I'm proposing here is certainly similar to the grouper, but I think I'm going for something that is more minimal. I'm also not 100% sure that I like their abbreviation scheme. For example, freq="M", is that month or minute?

samarpan-rai commented 3 years ago

Do you have an example in mind? Indeed the abbreviation on the surface can be confusing but shouldn't the document make it clear? The pandas doc makes it clear what the abbreviation means, so I am not against using the same abbreviation. People may actually already be familiar with the abbreviations so it will be user friendly, don't you think?

koaning commented 3 years ago

The doc makes it clear, but a user would need to leave the notebook and find that page in order to move on. I'd prefer to just use plain words. I also wouldn't want to assume that users of this package are familiar with python. There are no dependencies with the pydata stack and the API is more aligned with dplyr than pandas. It's my understanding that there's folks in infosec who like this tool because it is so lightweight. It seems that because pandas is so dependency heavy that it's not universally appreciated.

samarpan-rai commented 3 years ago

Meta : Maybe we should chat about this instead discuss in the issue to keep it clear?

I agree that it is lightweight compared to pandas. However, in this case I think using abbreviation makes more sense because we might be limited as to how many frequencies we are going to force the user's to spell out. For example, there is not just monthly frequency but its derivative like month start frequency and month end frequency. This holds for quarterly and yearly as well. It would be difficult to write them down in words. Therefore, my question is, what exact frequencies are we going to support? If it's just by second, minute, month, and year then your approach is better but as soon as it gets complicated then it might be difficult to use.

koaning commented 3 years ago

My preferred attitude here is only to start supporting extra features if there is a strong need for it. I added this issue because I had a need for basic rounding. It's a slippery slope to start supporting elaborate features here because it's going to end up in a lot of maintenance.

In my mind, second, minute, hour, day, month, year should just about cover 80% of use-cases. For the other 20%, I'd be fine with requiring the user to customise something themselves using a tool like arrow. This is mainly to keep any technical debt at bay and to prevent that this library might require dependencies.

That is, unless I've missed a cool trick that is implemented in the standard datetime submodule.

koaning commented 3 years ago

Now that I am thinking about it, maybe this is a more consistent API:

clump.mutate(rounded_date = lambda d: round_dt_str(d['dt'], by='hour'))

The benefits:

  1. This keeps the API functional/less magic.
  2. By calling it round_dt_str we don't imply that the object going in needs to be a datetime-object and can be a string instead.
samarpan-rai commented 3 years ago

My preferred attitude here is only to start supporting extra features if there is a strong need for it. I added this issue because I had a need for basic rounding. It's a slippery slope to start supporting elaborate features here because it's going to end up in a lot of maintenance.

Fair point!

In my mind, second, minute, hour, day, month, year should just about cover 80% of use-cases. For the other 20%, I'd be fine with requiring the user to customise something themselves using a tool like arrow. This is mainly to keep any technical debt at bay and to prevent that this library might require dependencies.

That is, unless I've missed a cool trick that is implemented in the standard datetime submodule.

As far as I know datetime doesn't has automatic format detection like datefinder. So this means you agree to have datefinder as a dependency?

samarpan-rai commented 3 years ago

Now that I am thinking about it, maybe this is a more consistent API:

clump.mutate(rounded_date = lambda d: round_dt_str(d['dt'], by='hour'))

The benefits:

1. This keeps the API functional/less magic.

2. By calling it `round_dt_str` we don't imply that the object going in needs to be a datetime-object and can be a string instead.

Thanks for the example. It helps better visualize what you want. However, having this function is not enough right? We also have to create groups based on these "rounded" data points which can then be supplied to agg function. Is this a correct assumption?

koaning commented 3 years ago

As far as I know datetime doesn't has automatic format detection like datefinder. So this means you agree to have datefinder as a dependency?

@samarpan-rai if we add such a utility I'd only consider it as an optional dependency, but for the simple case with datetime-string rounding, I think we won't need it. Again, I really prefer omitting any dependencies where possible. The Github actions currently run in 8s, which is something we can be proud of :)

Thanks for the example. It helps better visualize what you want. However, having this function is not enough right? We also have to create groups based on these "rounded" data points which can then be supplied to agg function. Is this a correct assumption?

What's wrong with this pattern:

(clump
  .mutate(grp = lambda d: round_dt_str(d['date'], 'hour'))
  .group_by("grp"))
samarpan-rai commented 3 years ago

As far as I know datetime doesn't has automatic format detection like datefinder. So this means you agree to have datefinder as a dependency?

@samarpan-rai if we add such a utility I'd only consider it as an optional dependency, but for the simple case with datetime-string rounding, I think we won't need it. Again, I really prefer omitting any dependencies where possible. The Github actions currently run in 8s, which is something we can be proud of :)

indeed it's something to be proud of. We will have to force the user to tell us the datime format. Are you okay with that?

Thanks for the example. It helps better visualize what you want. However, having this function is not enough right? We also have to create groups based on these "rounded" data points which can then be supplied to agg function. Is this a correct assumption?

What's wrong with this pattern:

(clump
  .mutate(grp = lambda d: round_dt_str(d['date'], 'hour'))
  .group_by("grp"))

Well in your case making of the group requires an additional step of first rounding it. Should we abstract this by detecting date time as key or should we let the user explicitly specify this two step process? The former might be easier to use and the later is easier to implement.

koaning commented 3 years ago

The former might be easier to use and the later is easier to implement.

Suppose we would like to implement something like a grouper, we still need to have a name for the groups we create. Pandas has an index that allows for unnamed columns, but that won't work here. Instead, we could do something like this;

(clump
  .group_by(grp = lambda d: round_dt_str(d['date'], 'hour')))

To me, this feels like the "cleanest" way to implement a grouper-kind-of-thing.

But this also feels like an anti-pattern. Currently we have mutate to add columns to a clumper. When-ever a user thinks "gotta add a column" we'd like them to think about mutate immediately. By allowing a group_by to also add columns we would be adding exceptions to this general pattern. I'm less worried about difficulty of use, I'm more worried about difficulty to conceptualise.

We might still explore this option, but I would need some convincing of a group of users that can give me some clear examples where this is a mayor pain. This library isn't meant for performance, it's meant to help clump json data in a function way.

So for now, to keep things simple, I'd prefer to ignore any changes to the group_by verb for now.

samarpan-rai commented 3 years ago

The former might be easier to use and the later is easier to implement.

Suppose we would like to implement something like a grouper, we still need to have a name for the groups we create. Pandas has an index that allows for unnamed columns, but that won't work here. Instead, we could do something like this;

(clump
  .group_by(grp = lambda d: round_dt_str(d['date'], 'hour')))

To me, this feels like the "cleanest" way to implement a grouper-kind-of-thing.

But this also feels like an anti-pattern. Currently we have mutate to add columns to a clumper. When-ever a user thinks "gotta add a column" we'd like them to think about mutate immediately. By allowing a group_by to also add columns we would be adding exceptions to this general pattern. I'm less worried about difficulty of use, I'm more worried about difficulty to conceptualise.

That's a fair point. Being able to put concept with function is quite important.

We might still explore this option, but I would need some convincing of a group of users that can give me some clear examples where this is a mayor pain. This library isn't meant for performance, it's meant to help clump json data in a function way.

So for now, to keep things simple, I'd prefer to ignore any changes to the group_by verb for now.

Agreed. So shall I start implementing round_dt_str function then?

koaning commented 3 years ago

Sure thing! Go for it :)