hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.54k stars 313 forks source link

Question about groupby OHLC data #104

Closed kv-gits closed 3 years ago

kv-gits commented 3 years ago

Pandas can scale timeframes this way

df.groupby(dr5minute.asof).agg({'Low': lambda s: s.min(), 
                                         'High': lambda s: s.max(),
                                         'Open': lambda s: s[0],
                                         'Close': lambda s: s[-1],
                                         'Volume': lambda s: s.sum()})

or

ohlc = {
    'open': 'first',
    'high': 'max',
    'low': 'min',
    'close': 'last',
    'volume': 'sum'
}

df = data.resample('5min', base=15).apply(ohlc)
df.head()

I found here GroupbyAggregators.h, but no examples in docs. Are there implemented similar functionality? I could try if not but need some direction how to do it

hosseinmoein commented 3 years ago

The groupby doc could be accessed from the readme. But this is the link https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/groupby.html

Currently you cannot do groupby of different columns with different aggregators in one shot. You have to do it individually. I was planning to make groupby more flexible like your example above. But I have to find time to do it

There is also a bucketize() method with a slightly different semantics: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/bucketize.html

kv-gits commented 3 years ago

bucketize() seems good. Thanks!

hosseinmoein commented 3 years ago

@kv-gits , I redesigned bucketize() to be more generalized. Please see the new interface in master (https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/bucketize.html)

kv-gits commented 3 years ago

For now I see here some potential issue. If first index(time_t) val is not the start of ohlc data, te first combined bar will be corrupted. I use such code to find start time of the bar which contains current time_t val

time_t getCandleIntraDay(time_t ts, int _period, int shift) {
    TimeStruct t = TimeStructFromTs(ts, false);
    t.hour = 0;
    t.min = 0;
    t.sec = 0;
    time_t sday = t.getEpoch();

    if (sday == ts) return ts;

    bool search = true;

    int periodsec = _period * 60;
    int n = 0;
    time_t res;

    while (search) {
        uint64_t t1 = sday + n * periodsec;
        uint64_t t2 = sday + (n + 1) * periodsec;

        if (t1 <= ts && t2 > ts) {
            res = t1 + shift * periodsec;
            search = false;
            break;
        }

        n++;
    }

    return res;
}

When I learn the inner architecture of DataFrame I could provide it as visitor for time_t data.

hosseinmoein commented 3 years ago

If you are referring to the bucketize logic, the index of each bucket is the last index in that bucket. It is explained in the docs

kv-gits commented 3 years ago

Yes, I understand. The issue is steel actual in this case too. So now I am handling it with filling data in df with own function managing datetimes. Dont know how to integrate it in DataFrame architecture for now. Will PR when solve it.

hosseinmoein commented 3 years ago

A new more generalized groupby was also added to master

kv-gits commented 3 years ago

May be it is good idea to provide custom visitor or simple functor also for index column when grouping? So there would be very flexible grouping operation.

hosseinmoein commented 3 years ago

in groupby(), one of the columns or the column could be the index column, if I understand your comment correctly. The docs are here: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/groupby.html

kv-gits commented 3 years ago

Now

auto    fut1 = df.groupby1_async<unsigned long>(DF_INDEX_COL_NAME,
                                                    std::make_tuple("str_col", "sum_str", SumVisitor<std::string>()),
                                                    std::make_tuple("xint_col", "max_int", MaxVisitor<int>()),
                                                    std::make_tuple("xint_col", "min_int", MinVisitor<int>()),
                                                    std::make_tuple("dbl_col", "sum_dbl", SumVisitor<double>()));

The idea is smth like this

auto    fut1 = df.groupby1_async<unsigned long>(std::make_tuple(INDEX_COL_NAME, indexModifyVisitor<IndexType>()),
                                                    std::make_tuple("str_col", "sum_str", SumVisitor<std::string>()),
                                                    std::make_tuple("xint_col", "max_int", MaxVisitor<int>()),
                                                    std::make_tuple("xint_col", "min_int", MinVisitor<int>()),
                                                    std::make_tuple("dbl_col", "sum_dbl", SumVisitor<double>()));

indexModifyVisitor Will set index value in the result dataframe.

hosseinmoein commented 3 years ago

ic, It is not a bad idea. I have to find some time to implement this

hosseinmoein commented 3 years ago

@kv-gits , I just implemented a new bucketize() that you could specify index bucketization in master. When I get more time, I will do something similar for groupby()

hosseinmoein commented 3 years ago

Now, the groupby() also allows you to specify a visitor for the resulting index column

kv-gits commented 3 years ago

Looks good!