go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.97k stars 276 forks source link

Add Aggregation_FIRST and _LAST options, and used interface to support strings #218

Open Amnesiac9 opened 10 months ago

Amnesiac9 commented 10 months ago

_FIRST will include the first found value in a series for that column, this allows you to keep data from columns that may be mismatched but you'd like to include the first value found after grouping by another column.

Trying to provide the equivalent of this from pandas:

grouped_df = orders_df.groupby(['customer_id', 'ShortSKU']).agg({
        'Quantity': 'sum',
        'Name': 'first',
        'Address': 'first',
        'Address2': 'first',
        'City': 'first',
        'State': 'first',
        'Zip': 'first',
        'Country': 'first',
        'Phone': 'first',
        'Email': 'first',
        'Club Enrollment': 'first',
        'Account Type': 'first',
        'Order Count': 'count',
    }).reset_index(drop=True)

In Go:

group := *df.GroupBy("CustomerId", "ShortSku")
    if group.Err != nil {
        return nil, group.Err
    }

    agg_df := group.Aggregation([]dataframe.AggregationType{
        dataframe.Aggregation_SUM,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_FIRST,
        dataframe.Aggregation_SUM,
        dataframe.Aggregation_COUNT},
        []string{
            "Quantity",
            "CustomerName",
            "Address1",
            "Address2",
            "City",
            "State",
            "Zip",
            "Country",
            "Phone",
            "Email",
            "ClubEnrollment",
            "AccountType",
            "Spend",
            "OrderCount"})
    if agg_df.Err != nil {
        return nil, agg_df.Err
    }

Tests need to be updated, but someone more similar with the code base might see issues with this change. Let me know.

Also, I could not find where the name of the Aggregation type is stored, but this doesn't add "_COUNT" to the original column name like the rest do. Instead, it adds the full type including the number value of the aggregation type.

vyassamir11 commented 10 months ago

@Amnesiac9 can you please also add _LAST aggregation? Thanks

vyassamir11 commented 10 months ago

We could argue that _LAST is equivalent of sorting the dataframe in reverse order and applying _FIRST, but it would make sense to add it as a separate option for the completion.

Amnesiac9 commented 10 months ago

Done. Still need to refactor the aggregation testing now to test both of these.