Open yohplala opened 2 years ago
@JovanVeljanoski what do you think?
The request makes sense to me. I don't know what the api should look like yet, but having a reference point does make sense. Especially given this: https://github.com/vaexio/vaex/issues/408#issuecomment-694674972 :)
Dunno how we are with time to take this one, but that is a separate story.
Here's a quick modification that seems to work in the limited testing I've done...
class AnchoredBinnerTime(vaex.BinnerTime):
def __init__(self, expression, resolution='W', df=None, every=1, start_anchor=None):
self.resolution = resolution
self.expression = expression
self.df = df or expression.ds
self.every = every
self.sort_indices = None
# make sure it's an expression
self.expression = self.df[str(self.expression)]
self.label = self.expression._label
self.tmin, self.tmax = self.df[str(self.expression)].minmax()
self.resolution_type = 'M8[%s]' % self.resolution
if start_anchor is not None:
self.tmin = np.datetime64(start_anchor).astype(self.resolution_type)
dt = (self.tmax.astype(self.resolution_type) - self.tmin.astype(self.resolution_type))
self.N = (dt.astype(int).item() + 1)
# divide by every, and round up
self.N = (self.N + every - 1) // every
self.bin_values = np.arange(self.tmin.astype(self.resolution_type), self.tmax.astype(self.resolution_type)+1, every)
self._promise = vaex.promise.Promise.fulfilled(None)
# Testing resampling with '4h' binning with a chosen start anchor.
vdf.groupby(AnchoredBinnerTime(vdf.ts, resolution='h', every=4, start_anchor='2022-03-01 00:00'), agg={'sum': vaex.agg.sum("val")})
Out[101]:
# ts sum
0 2022-03-01 00 0
1 2022-03-01 04 1
2 2022-03-01 08 5
Description Similar to pandas
origin
parameter inresample
method, could vaex'sBinnerTime
offers an equivalentorigin
parameter? As per pandas documentation:__
Is your feature request related to a problem? Please describe. When conducting a
groupby
usingBinnerTime
, the anchor used currently for the bins is the timestamp of the 1st row.But what if I want my bins anchored to midnight? (the bin to start at midnight) In this case, result in column
sum
would not be the same.Additional context A workaround I see is to re-use some pandas functions to modify the timestamp of the first row, and set it at the timestamp the user wants the 1st bin to start.
... well, I thought it would work, but vaex is moving the start one hour earlier. Hmmm... what is this mystery about?
First timestamp is starting at 11pm on 28th of February?