jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.55k stars 642 forks source link

Add method to parse string columns into list of string columns #986

Open eric-liuyd opened 3 years ago

eric-liuyd commented 3 years ago

One of the most needed functions in data analysis is the operation among multiple columns and generation of new columns (and rows). The need can be abstracted to a unified method like List<Column> columnOperate(List<Column>) to accomplish inter-column operation tasks. But now, I encountered an inter-column operation problem which cannot be solved efficiently and elegantly using only a few methods. In fact I found I couldn't solve this basic need using methods given in tablesaw. Details is provided below.

Let's say I have a Table named "df" with two columns "multi_ratio" and "amount".

          df          
 amount  |  multi_ratio  |
--------------------------
    100  |      0.8,0.2  |
    200  |      0.5,0.5  |

Now I need to 1) split every value in col "multi_ratio" into multiple values (e.g., convert "0.8,0.2" to List of 0.8, 0.2) , 2) amount multi_ratio (split) (e.g., 100 List of 0.8, 0.2 -> List of 80, 20) , 3) result of 2. expanded to multiple rows.

So the final result I need would be.

                      df2                       
 amount  |  multi_ratio_single  |  multiply_result  |
-----------------------------------------------------
    100  |                 0.8  |               80  |
    100  |                 0.2  |               20  |
    200  |                 0.5  |              100  |
    200  |                 0.5  |              100  |

To achieve this goal, I firstly make an empty copy of df and add empty columns.

Table df2 = df.emptyCopy();

df2.addColumns(
        StringColumn.create("multi_ratio_single"),
        DoubleColumn.create("result")
);

Then, I tried to operate on each row of df, to generate new rows and add them to df2, but it seems just not work.

Does anybody have suggests to make this happen efficiently?

lwhite1 commented 3 years ago

Hi, I'm sorry you're not seeing the functionality you expect. I agree that a parse function that splits string columns should be there and wouldn't be hard to write.

If I understand the problem, I think you can get the results you're looking for writing a simple loop:

    void parse(Table df, StringColumn sc) {
        DoubleColumn a = DoubleColumn.create("a", df.rowCount());
        DoubleColumn b = DoubleColumn.create("b", df.rowCount());
        df.addColumns(a, b);
        for (Row row: df) {
            String[] rats = row.getString("multi-ratios").split(",");
            row.setDouble("a", Double.parseDouble(rats[0]));
            row.setDouble("b", Double.parseDouble(rats[1]));
        }
        DoubleColumn amount = df.doubleColumn("amount");
        df.addColumns(amount.multiply(a), amount.multiply(b));
    }
eric-liuyd commented 3 years ago

Hi, I'm sorry you're not seeing the functionality you expect. I agree that a parse function that splits string columns should be there and wouldn't be hard to write.

If I understand the problem, I think you can get the results you're looking for writing a simple loop:

   void parse(Table df, StringColumn sc) {
       DoubleColumn a = DoubleColumn.create("a", df.rowCount());
       DoubleColumn b = DoubleColumn.create("b", df.rowCount());
       df.addColumns(a, b);
       for (Row row: df) {
           String[] rats = row.getString("multi-ratios").split(",");
           row.setDouble("a", Double.parseDouble(rats[0]));
           row.setDouble("b", Double.parseDouble(rats[1]));
       }
       DoubleColumn amount = df.doubleColumn("amount");
       df.addColumns(amount.multiply(a), amount.multiply(b));
   }

Thanks lwhite1. But there seems to be some misunderstanding I think. The parse method is not the core request. The most important thing I consider is how to expand one row to multiple rows. For example, expand String "0.6,0.3,0.1" in one row to multiple rows, each with value "0.6", "0.3", "0.1".

ccleva commented 3 years ago

Hi @eric-liuyd,

For this you can write a slightly different (double) loop:

    Table df = Table.create("orig", IntColumn.create("amount"), StringColumn.create("multi_ratio"));
    df.intColumn("amount").append(100).append(200);
    df.stringColumn("multi_ratio").append("0.8,0.2").append("0.5,0.5");

    Table df2 = Table.create("result", IntColumn.create("amount"), DoubleColumn.create("multi_ratio_single"));
    for(Row row : df) {
        for(String s : row.getString("multi_ratio").split(",")) {
            df2.intColumn("amount").append(row.getInt("amount"));
            df2.doubleColumn("multi_ratio_single").append(Double.parseDouble(s));
        }
    }

    df2.addColumns(df2.doubleColumn("multi_ratio_single").multiply(df2.intColumn("amount")).setName("multiply_result"));

Note that this code is not production grade ...

You probably can also implement it by extracting the columns as in @lwhite1 example and do some pivoting. Not sure which is better.

lwhite1 commented 3 years ago

Hi @eric-liuyd https://github.com/eric-liuyd,

I would look at the Table:melt() method, which implements the "tidy" melt operation, but I don't have time to confirm and write it up.

OTOH, if I were doing it myself, I might do something like what @ccleva suggests: basically copying into a new table in a loop.

Hope one of these works.

larry

On Mon, Sep 13, 2021 at 6:13 AM ccleva @.***> wrote:

Hi @eric-liuyd https://github.com/eric-liuyd,

For this you can write a slightly different (double) loop:

Table df = Table.create("orig", IntColumn.create("amount"), StringColumn.create("multi_ratio"));
df.intColumn("amount").append(100).append(200);
df.stringColumn("multi_ratio").append("0.8,0.2").append("0.5,0.5");

Table df2 = Table.create("result", IntColumn.create("amount"), DoubleColumn.create("multi_ratio_single"));
for(Row row : df) {
    for(String s : row.getString("multi_ratio").split(",")) {
        df2.intColumn("amount").append(row.getInt("amount"));
        df2.doubleColumn("multi_ratio_single").append(Double.parseDouble(s));
    }
}

df2.addColumns(df2.doubleColumn("multi_ratio_single").multiply(df2.intColumn("amount")).setName("multiply_result"));

Note that this code is not production grade ...

You probably can also implement it by extracting the columns as in @lwhite1 https://github.com/lwhite1 example and do some pivoting. Not sure which is better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/986#issuecomment-918042582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAUFQH33W73MAQHBVHTUBXFFTANCNFSM5DBABISA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

eric-liuyd commented 3 years ago

Hi @eric-liuyd https://github.com/eric-liuyd, I would look at the Table:melt() method, which implements the "tidy" melt operation, but I don't have time to confirm and write it up. OTOH, if I were doing it myself, I might do something like what @ccleva suggests: basically copying into a new table in a loop. Hope one of these works. larry On Mon, Sep 13, 2021 at 6:13 AM ccleva @.***> wrote: Hi @eric-liuyd https://github.com/eric-liuyd, For this you can write a slightly different (double) loop: Table df = Table.create("orig", IntColumn.create("amount"), StringColumn.create("multi_ratio")); df.intColumn("amount").append(100).append(200); df.stringColumn("multi_ratio").append("0.8,0.2").append("0.5,0.5"); Table df2 = Table.create("result", IntColumn.create("amount"), DoubleColumn.create("multi_ratio_single")); for(Row row : df) { for(String s : row.getString("multi_ratio").split(",")) { df2.intColumn("amount").append(row.getInt("amount")); df2.doubleColumn("multi_ratio_single").append(Double.parseDouble(s)); } } df2.addColumns(df2.doubleColumn("multi_ratio_single").multiply(df2.intColumn("amount")).setName("multiply_result")); Note that this code is not production grade ... You probably can also implement it by extracting the columns as in @lwhite1 https://github.com/lwhite1 example and do some pivoting. Not sure which is better. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#986 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAUFQH33W73MAQHBVHTUBXFFTANCNFSM5DBABISA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Ok. I strongly suggest you do it when you are available. Btw, could the method be implemented like Pandas DataFrame:apply() ? The apply() method can both operate on row axis and column axis.

eric-liuyd commented 3 years ago

Hi @eric-liuyd,

For this you can write a slightly different (double) loop:

    Table df = Table.create("orig", IntColumn.create("amount"), StringColumn.create("multi_ratio"));
    df.intColumn("amount").append(100).append(200);
    df.stringColumn("multi_ratio").append("0.8,0.2").append("0.5,0.5");

    Table df2 = Table.create("result", IntColumn.create("amount"), DoubleColumn.create("multi_ratio_single"));
    for(Row row : df) {
        for(String s : row.getString("multi_ratio").split(",")) {
            df2.intColumn("amount").append(row.getInt("amount"));
            df2.doubleColumn("multi_ratio_single").append(Double.parseDouble(s));
        }
    }

    df2.addColumns(df2.doubleColumn("multi_ratio_single").multiply(df2.intColumn("amount")).setName("multiply_result"));

Note that this code is not production grade ...

You probably can also implement it by extracting the columns as in @lwhite1 example and do some pivoting. Not sure which is better.

Hi ccleva. Thanks for your code. It really works. I didn't get the use of append() before.