hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.54k stars 313 forks source link

how to get_data_by_sel by more than three columns #152

Closed wujinghe closed 2 years ago

wujinghe commented 2 years ago

How to get_data_by_sel by more than three columns? Because get_data_by_sel just supports to pass three columns as arguments.

I want to call get_view_by_sel more time to implement my requirment and change view to DataFrame. But I found view can't be converted to DataFrame directly.

hosseinmoein commented 2 years ago

I can add get_data_by_sel and get_view_by_sel for more columns in the next few days. Being able to assign a view to a DataFrame is not simple, a bit tricky. But there are way to get around it. For example, if you ultimately need a DataFrame why do you get a view in the first place?

wujinghe commented 2 years ago

I use pipeline to process data. Maybe I filter in stage 1, calculate in stage 2, and response the result to client in stage 3. Every stage runs in different thread. New data will be inserted to the DataFrame at the same time, and the memory of vector which is the container of DataFrame may reallocate, so I can't use view. On the other hand, I don’t want to use locks.

hosseinmoein commented 2 years ago

@wujinghe , I just added get_data_by_sel() and get_view_by_sel() for up to 5 columns in master

wujinghe commented 2 years ago

Thank you so much. At the same time, I use std::tuple to support more columns. The following is the test code, you can check the interfaces.

#define DECL_COL(COLUMN_NAME, C_TYPE) \
    struct COLUMN_NAME \
    { \
        constexpr static const char* name = {#COLUMN_NAME}; \
        using type = C_TYPE; \
    }; \

struct MyDf
{
    DECL_COL(Col0, std::string);
    DECL_COL(Col1, int32_t);
    DECL_COL(Col2, int64_t);
    DECL_COL(Col3, double);
    DECL_COL(Col4, std::string);
};

void TestSchema()
{
    // Prepare initial data
    std::vector<unsigned long> col_idx = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    std::vector<MyDf::Col0::type> col0 =
        {"01", "02", "03", "04", "05", "06", "07", "08", "09", "10"};
    std::vector<MyDf::Col1::type> col1 = {1, 1, 2, 2, 3, 4, 5, 5, 5, 6};
    std::vector<MyDf::Col2::type> col2 =
        {11, 11, 21, 21, 31, 41, 51, 51, 51, 61};
    std::vector<MyDf::Col3::type> col3 =
        {0.11, 0.12, 0.21, 0.22, 0.30, 0.40, 0.51, 0.52, 0.53, 0.60};
    std::vector<MyDf::Col4::type> col4 =
        {"11", "11", "21", "21", "31", "41", "51", "51", "511", "61"};

    // Construct df
    hmdf::StdDataFrame<unsigned long> df;
    df.load_index(std::move(col_idx));
    df.load_column<MyDf::Col0::type>(MyDf::Col0::name, std::move(col0));
    df.load_column<MyDf::Col1::type>(MyDf::Col1::name, std::move(col1));
    df.load_column<MyDf::Col2::type>(MyDf::Col2::name, std::move(col2));
    df.load_column<MyDf::Col3::type>(MyDf::Col3::name, std::move(col3));
    df.load_column<MyDf::Col4::type>(MyDf::Col4::name, std::move(col4));

    // Output the result
    df.write<std::ostream,
             MyDf::Col0::type,
             MyDf::Col1::type,
             MyDf::Col2::type,
             MyDf::Col3::type,
             MyDf::Col4::type>(std::cout, hmdf::io_format::csv2);
    std::cout << "=============================" << std::endl;

    // Test to get data by filter
    auto filter_functor = [](const unsigned long& index,
                             const std::tuple<
                             MyDf::Col1::type,
                             MyDf::Col2::type,
                             MyDf::Col3::type,
                             MyDf::Col4::type>& t)-> bool {
        return std::get<0>(t) > 2 and
            std::get<1>(t) > 31 and
            std::get<2>(t) > 0.5 and
            std::get<3>(t) != std::string("511");
    };
    std::tuple cols_for_filter {
        MyDf::Col1(), MyDf::Col2(), MyDf::Col3(), MyDf::Col4()};
    auto df_result = df.get_data_by_sel<
        decltype(cols_for_filter), decltype(filter_functor),
        MyDf::Col0::type,
        MyDf::Col1::type,
        MyDf::Col2::type,
        MyDf::Col3::type,
        MyDf::Col4::type>(cols_for_filter, filter_functor);

    // Output the result
    df_result.write<std::ostream,
             MyDf::Col0::type,
             MyDf::Col1::type,
             MyDf::Col2::type,
             MyDf::Col3::type,
             MyDf::Col4::type>(std::cout, hmdf::io_format::csv2);
}

Do you think it is ok, or any suggestions?

hosseinmoein commented 2 years ago

Looks good to me