hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.54k stars 313 forks source link

replace function - replace different column than input column #72

Closed HA6Bots closed 3 years ago

HA6Bots commented 4 years ago

Hi,

I want to replace values in one column according to values in another. So for example I have the dataframe:

index: {0, 1, 2, 3, 4, 5, 6, 7} int_col: { 1, 2, 3, 50, 6, 7, 8, 30} int_col_2: { 5,5,5,5,5,5,5,5}

And I want to replace int_col_2 values with 0 or 1, if the corresponding value at int_col is odd or even

So my desired output would be:

index: {0, 1, 2, 3, 4, 5, 6, 7} int_col: { 1, 2, 3, 50, 6, 7, 8, 30} int_col_2: { 0, 1, 0, 1, 1, 0, 1, 1}

Is this possible to achieve using the replace function using functors? I ask because it seems as though you can only pass one column for input which is the same one affected by the functor e.g.

struct ReplaceFunctor {

    bool operator() (const unsigned int& idx, int& value) {

        value *= 5;

        return (true);
    }

    size_t  count{ 0 };
};

ReplaceFunctor  functor;
dataframe.replace<int, ReplaceFunctor>("int_col", functor);

However to perform my task my ReplaceFunctor will have to look something like this:

struct ReplaceFunctor {

    bool operator() (const unsigned int& idx, int& value1, int& value2) {

        if(value1 % 2 == 0)
                    value2 = 1;
                else
                    value2 = 0;

        return (true);
    }

    size_t  count{ 0 };
};

Please let me know if there's a way to achieve this using the replace function (or any other function for that matter), or if there's an example of something like this that I missed in the docs. Thanks

hosseinmoein commented 4 years ago

There are a few ways of doing that:

First, every method that expects a functor, also excepts a lambda. In any case, if you want to use the replace() or replace_async() methods, then your functor/lambda must capture a const reference to the int_col vector. So, pass a ref to get_column<int>("int_col") to the constructor of your functor or capture list of your lambda. Than inside your functor with a counter, you have access to both columns.

Alternatively, you can use the visit() or visit_async() methods which has up to 5 columns passed to it. Again here you need to write your visitor functor/lambda. But in this case you do not need to capture the ref to the int_col

Please see examples in table of features: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html

HA6Bots commented 4 years ago

Thanks for the pointer. Works great.

struct ReplaceFunctor {

    std::vector<int> helperColumn;

    ReplaceFunctor(std::vector<int>& chelper) {
        helperColumn = chelper;
    }

    bool operator() (const unsigned int& idx, int& value) {

        if (helperColumn.at(idx) % 2 == 0) {
            value = 1;
        } else {
            value = 0;
        }
        return (true);
    }

    size_t  count{ 0 };
};

...
ReplaceFunctor  functor(df.get_column<int>("int_col"));
df.replace<int, ReplaceFunctor>("int_col_2", functor);
hosseinmoein commented 4 years ago

one note: You are copying the data vector for int_col (it could be very expensive potentially). Also constness is very important and is a big help to the compiler. I would write it like this:

struct ReplaceFunctor {

    const std::vector<int> &helperColumn;  // This is very important change

    ReplaceFunctor(const std::vector<int> &chelper) : helperColumn(chelper) {  }

    bool operator() (const unsigned int& idx, int& value) {

                value = ! (helperColumn.at(idx) % 2); // This change is optional. Your original code also works
        return (true);
    }

    size_t  count{ 0 };
};
HA6Bots commented 4 years ago

Thanks for the advice. Still fairly new to C++, so I appreciate this.