Closed kv-gits closed 3 years ago
The groupby doc could be accessed from the readme. But this is the link https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/groupby.html
Currently you cannot do groupby of different columns with different aggregators in one shot. You have to do it individually. I was planning to make groupby more flexible like your example above. But I have to find time to do it
There is also a bucketize()
method with a slightly different semantics: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/bucketize.html
bucketize() seems good. Thanks!
@kv-gits , I redesigned bucketize()
to be more generalized.
Please see the new interface in master (https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/bucketize.html)
For now I see here some potential issue. If first index(time_t) val is not the start of ohlc data, te first combined bar will be corrupted. I use such code to find start time of the bar which contains current time_t val
time_t getCandleIntraDay(time_t ts, int _period, int shift) {
TimeStruct t = TimeStructFromTs(ts, false);
t.hour = 0;
t.min = 0;
t.sec = 0;
time_t sday = t.getEpoch();
if (sday == ts) return ts;
bool search = true;
int periodsec = _period * 60;
int n = 0;
time_t res;
while (search) {
uint64_t t1 = sday + n * periodsec;
uint64_t t2 = sday + (n + 1) * periodsec;
if (t1 <= ts && t2 > ts) {
res = t1 + shift * periodsec;
search = false;
break;
}
n++;
}
return res;
}
When I learn the inner architecture of DataFrame I could provide it as visitor for time_t data.
If you are referring to the bucketize logic, the index of each bucket is the last index in that bucket. It is explained in the docs
Yes, I understand. The issue is steel actual in this case too. So now I am handling it with filling data in df with own function managing datetimes. Dont know how to integrate it in DataFrame architecture for now. Will PR when solve it.
A new more generalized groupby
was also added to master
May be it is good idea to provide custom visitor or simple functor also for index column when grouping? So there would be very flexible grouping operation.
in groupby()
, one of the columns or the column could be the index column, if I understand your comment correctly. The docs are here: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/groupby.html
Now
auto fut1 = df.groupby1_async<unsigned long>(DF_INDEX_COL_NAME,
std::make_tuple("str_col", "sum_str", SumVisitor<std::string>()),
std::make_tuple("xint_col", "max_int", MaxVisitor<int>()),
std::make_tuple("xint_col", "min_int", MinVisitor<int>()),
std::make_tuple("dbl_col", "sum_dbl", SumVisitor<double>()));
The idea is smth like this
auto fut1 = df.groupby1_async<unsigned long>(std::make_tuple(INDEX_COL_NAME, indexModifyVisitor<IndexType>()),
std::make_tuple("str_col", "sum_str", SumVisitor<std::string>()),
std::make_tuple("xint_col", "max_int", MaxVisitor<int>()),
std::make_tuple("xint_col", "min_int", MinVisitor<int>()),
std::make_tuple("dbl_col", "sum_dbl", SumVisitor<double>()));
indexModifyVisitor Will set index value in the result dataframe.
ic, It is not a bad idea. I have to find some time to implement this
@kv-gits ,
I just implemented a new bucketize()
that you could specify index bucketization in master. When I get more time, I will do something similar for groupby()
Now, the groupby()
also allows you to specify a visitor for the resulting index column
Looks good!
Pandas can scale timeframes this way
or
I found here GroupbyAggregators.h, but no examples in docs. Are there implemented similar functionality? I could try if not but need some direction how to do it