hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.41k stars 306 forks source link

Aggregate visitors can't be used in groupby #251

Closed JamieStM closed 1 year ago

JamieStM commented 1 year ago

Trialing this library, I found an issue with the groupby* functions: you can't pass an aggregate visitor, e.g. MedianVisitor, QuantileVisitor, ModeVisitor, MADVisitor, etc, as it will fail to compile. This is because the internal _load_groupby_data_* functions call the visitors with only two arguments (index and cell value), whereas the aggregate visitors require four arguments (first index, last index, first cell value, last cell value). These are pretty standard visitors that I would expect to be valid use cases for groupby.

v2.1.0 (from vcpkg), C++17, MSVC.

Reproducible example:

#include <DataFrame/DataFrame.h>
#include <DataFrame/DataFrameStatsVisitors.h>
#include <cstdlib>

int main()
{
    // Generate some data; doesn't really matter
    std::vector<int> indices(10 * 10);
    std::iota(indices.begin(), indices.end(), 0);
    std::vector<int> outer{};
    std::vector<int> inner{};
    for (int i = 0; i < 10; ++i) {
        for (int j = 0; j < 10; ++j) {
            outer.push_back(i);
            inner.push_back(std::rand() / ((i + 1) * 2));
        }
    }

    // Create df with int index and two int columns
    auto df = hmdf::StdDataFrame<int>{};
    df.load_index(std::move(indices));
    df.load_column("outer", std::move(outer));
    df.load_column("inner", std::move(inner));

    // Try to group by with median; compilation fails.
    auto groupedByMedian = df.groupby1<int>(
        "outer",
        hmdf::FirstVisitor<int, int>{},
        std::make_tuple("inner", "median", hmdf::MedianVisitor<int, int>{}));
}

From this, there's an error raised that directs me to here, specifically line 121:

https://github.com/hosseinmoein/DataFrame/blob/d62c8dd820d815b699666a70e22426beedac455c/include/DataFrame/Internals/DataFrame_standalone.tcc#L117-L126

Which is because visitor(int, int); isn't compatible with MedianVisitor::operator():

https://github.com/hosseinmoein/DataFrame/blob/d62c8dd820d815b699666a70e22426beedac455c/include/DataFrame/DataFrameStatsVisitors.h#L2521-L2541

hosseinmoein commented 1 year ago

Yes I know. That's on my todo list to enhance. The reason for this is to create that interface mandates copying data which is inefficient. But I will do it in the next couple of weeks.

hosseinmoein commented 1 year ago

Also, when I fix that you will need C++20 because master now needs C++20

hosseinmoein commented 1 year ago

This has been implemented in master