go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 277 forks source link

Incorrect output when sorting on multiple columns using DataFrame.Arrange(). #61

Open bfitzsimmons opened 6 years ago

bfitzsimmons commented 6 years ago

I'm currently having an issue when attempting to sort by multiple columns.

Given the following code (I'll explain the commented lines in a moment.):

package main

import (
    "fmt"

    "github.com/kniren/gota/dataframe"
)

func main() {
    df := dataframe.LoadRecords(
        [][]string{
            {"A", "B"},
            {"0.346", "662"},
            {"0.331", "725"},
            // { "0.33", "561"},
            // { "0.322", "593"},
            // { "0.322", "543"},
            // { "0.32", "707"},
            // { "0.32", "568"},
            // { "0.318", "671"},
            // {"0.318", "645"},
            // { "0.314", "540"},
            // { "0.312", "679"},
            {"0.31", "682"},
            {"0.309", "680"},
            {"0.308", "695"},
            {"0.307", "514"},
            {"0.306", "530"},
            // { "0.306", "507"},
            // { "0.305", "597"},
            {"0.304", "675"},
            {"0.304", "718"},
            // { "0.303", "576"},
            // { "0.303", "515"},
            // { "0.301", "605"},
            // { "0.3", "645"},
            // { "0.3", "566"},
            {"0.299", "564"},
            {"0.297", "665"},
            {"0.297", "689"},
            {"0.297", "507"},
            {"0.295", "665"},
            // { "0.295", "613"},
            {"0.294", "577"},
            {"0.293", "577"},
            {"0.293", "586"},
            {"0.293", "675"},
            {"0.29", "589"},
            {"0.288", "568"},
            {"0.288", "630"},
            {"0.288", "645"},
            {"0.288", "573"},
        },
    )

    fmt.Println(df.Arrange(dataframe.Sort("A"), dataframe.Sort("B")))
}

I get a correct output of:

[23x2] DataFrame

    A        B
 0: 0.288000 568
 1: 0.288000 573
 2: 0.288000 630
 3: 0.288000 645
 4: 0.290000 589
 5: 0.293000 577
 6: 0.293000 675
 7: 0.293000 586
 8: 0.294000 577
 9: 0.295000 665
    ...      ...
    <float>  <int>

Now comes the reason for the commented out lines.

If I uncomment any of the commented lines, I get the following output.

[24x2] DataFrame

    A        B
 0: 0.288000 645
 1: 0.288000 568
 2: 0.288000 573
 3: 0.288000 630
 4: 0.290000 589
 5: 0.293000 577
 6: 0.293000 675
 7: 0.293000 586
 8: 0.294000 577
 9: 0.295000 665
    ...      ...
    <float>  <int>

The order is no longer correct. Please note the "B" column.

Since I don't yet know what combination of values is causing the incorrect sorting, I've left them all commented out in the data. This is in the hopes of someone seeing something in the values that might trigger this incorrect behavior.

Any thoughts on what might be happening?

bfitzsimmons commented 6 years ago

Is this related to https://github.com/kniren/gota/issues/50?