rocketlaunchr / dataframe-go

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Other
1.16k stars 93 forks source link

Inconsistent behavior for Apply when using with ApplyDataFrameFn #43

Closed migscabral closed 3 years ago

migscabral commented 3 years ago

I'm trying to concatenate two columns in a dataframe and put it into a new column. The behavior is very inconsistent. Sometimes the strings are concatenated into the new column. Sometimes the value is just set to NaN.

In this run, the value for concat_contact_number in the resulting dataframe was correctly set to 97312345678. The map value for concat_contact_number also reflects the concatenated value.

Expected output:

$ go run main.go 
INFO[0000] In applyConcatDf: vals[contact_number_country_code]: 973 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 973 
INFO[0000] In applyConcatDf: vals[contact_number]: 12345678 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 97312345678 
INFO[0000] In applyConcatDf: vals: map[0:973 1:12345678 2:<nil> concat_contact_number:97312345678 contact_number:12345678 contact_number_country_code:973] 
INFO[0000] In prepareDataframe:                         
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |      97312345678      |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+ 
INFO[0000] In main:                                     
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |      97312345678      |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+ 

In this run, the value for concat_contact_number in the resulting dataframe was incorrectly set to NaN. Same as with the correct run, the map value for concat_contact_number is also set to the expected concatenated value.

Erroneous output:

$ go run main.go 
INFO[0000] In applyConcatDf: vals[contact_number_country_code]: 973 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 973 
INFO[0000] In applyConcatDf: vals[contact_number]: 12345678 
INFO[0000] In applyConcatDf: vals[concat_contact_number]: 97312345678 
INFO[0000] In applyConcatDf: vals: map[0:973 1:12345678 2:<nil> concat_contact_number:97312345678 contact_number:12345678 contact_number_country_code:973] 
INFO[0000] In prepareDataframe:                         
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |          NaN          |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+ 
INFO[0000] In main:                                     
INFO[0000] +-----+-----------------------------+----------------+-----------------------+
|     | CONTACT NUMBER COUNTRY CODE | CONTACT NUMBER | CONCAT CONTACT NUMBER |
+-----+-----------------------------+----------------+-----------------------+
| 0:  |             973             |    12345678    |          NaN          |
+-----+-----------------------------+----------------+-----------------------+
| 1X3 |           STRING            |     STRING     |        STRING         |
+-----+-----------------------------+----------------+-----------------------+ 

It can be observed that in both cases the map value for 2 is always <nil>. Is this expected?

Run this code several times to see deviances in the output. The issue may not show up immediately. Sometimes it takes 10x runs, sometimes only 2x run. Again the behavior is inconsistent.

Working code:

package main

import (
    "context"
    "fmt"
    "strings"

    dataframe "github.com/rocketlaunchr/dataframe-go"
    "github.com/rocketlaunchr/dataframe-go/imports"
    log "github.com/sirupsen/logrus"
)

// applyConcatDf returns an ApplyDataFrameFn that concatenates the given column names into another column
func applyConcatDf(dest_column string, columns []string) dataframe.ApplyDataFrameFn {
    return func(vals map[interface{}]interface{}, row, nRows int) map[interface{}]interface{} {
        vals[dest_column] = ""
        for _, key := range columns {
            log.Infof("vals[%s]: %s", key, vals[key].(string))
            vals[dest_column] = vals[dest_column].(string) + vals[key].(string)
            log.Infof("vals[%s]: %s", dest_column, vals[dest_column].(string))
        }

        log.Infof("vals: %v", vals)
        return vals
    }
}

// applySetupDataframe initializes the dataframe from a CSV string
func setupDataframe() *dataframe.DataFrame {
    ctx := context.Background()

    csvStr := `contact_number_country_code,contact_number
"973","12345678"`

    df, _ := imports.LoadFromCSV(ctx, strings.NewReader(csvStr), imports.CSVLoadOptions{
        DictateDataType: map[string]interface{}{
            "contact_number_country_code": "",
            "contact_number":              "",
        },
    })

    return df
}

// prepareDataframe applies the concatenation on the loaded dataframe
func prepareDataframe(df *dataframe.DataFrame) {
    ctx := context.Background()

    sConcatContactNumber := dataframe.NewSeriesString("concat_contact_number", &dataframe.SeriesInit{Size: df.NRows()})
    df.AddSeries(sConcatContactNumber, nil)

    _, err := dataframe.Apply(ctx, df, applyConcatDf("concat_contact_number", []string{"contact_number_country_code", "contact_number"}), dataframe.FilterOptions{InPlace: true})

    if err != nil {
        log.WithError(err).Error("concatenation cannot be applied")
    }

    fmt.Println(df)
}

func main() {
    df := setupDataframe()
    prepareDataframe(df)
    fmt.Println(df)
}
pjebs commented 3 years ago
var ctx = context.Background()

func main() {

    csvStr := `contact_number_country_code,contact_number
"973","12345678"`

    df, _ := imports.LoadFromCSV(ctx, strings.NewReader(csvStr), imports.CSVLoadOptions{
        DictateDataType: map[string]interface{}{
            "contact_number_country_code": "",
            "contact_number":              "",
        },
    })

    sConcatContactNumber := dataframe.NewSeriesString("concat_contact_number", &dataframe.SeriesInit{Size: df.NRows()})
    df.AddSeries(sConcatContactNumber, nil)

    applyFn := dataframe.ApplyDataFrameFn(func(vals map[interface{}]interface{}, row, nRows int) map[interface{}]interface{} {
        return map[interface{}]interface{}{
            "concat_contact_number": vals["contact_number_country_code"].(string) + vals["contact_number"].(string),
        }
    })

    _, err := dataframe.Apply(ctx, df, applyFn, dataframe.FilterOptions{InPlace: true})
    if err != nil {
        log.WithError(err).Error("concatenation cannot be applied")
    }

    fmt.Println(df)
}
pjebs commented 3 years ago

@migscabral

Can you try the sample I wrote above.

The issue is when you return the vals. The return map indicates what you want to change.

The key of map accepts ints (for index of Series) or strings (for name of series).

In your erroneous case:

 vals: map[0:973 1:12345678 2:<nil> concat_contact_number:97312345678 contact_number:12345678 contact_number_country_code:973]

It is saying change second column to nil but also saying change concat_contact_number Series to "97312345678". They point to the same Series in your case.

Either use ints or strings when referring to Series but not both.

migscabral commented 3 years ago

@pjebs I updated my code to return a new map from inside ApplyDataFrameFn and it now works. Thank you.

I have several questions so I may understand better how this library works:

  1. I'm not sure where in my code where the int or string keys were set. From what I understand I did not explicitly set to use the int or string keys. The vals map that was received by ApplyDataFrameFn already contained both. Can you point me where it was set?

  2. The key difference that I saw between your implementation and mine is that you created a new map inside the ApplyDataFrameFn instead of directly modifying the vals map. Is this the recommended way?

pjebs commented 3 years ago
  1. the vals param contains the existing values for the row with the key as an int (index) and string (name) for convenience.

  2. The applyFn must return a map that contains only what you want to update. You were basically rereturning the current row values (and not just the changes to update). I have updated the documentation to make it clearer.

migscabral commented 3 years ago

Thank you @pjebs much clearer now. Do you have a link to the said documentation?

pjebs commented 3 years ago

let me refresh godocs.org

pjebs commented 3 years ago

https://godoc.org/github.com/rocketlaunchr/dataframe-go#Apply https://godoc.org/github.com/rocketlaunchr/dataframe-go#ApplyDataFrameFn