go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 277 forks source link

Is using sort.Sort safe when specifying more than one order column? #50

Open tobgu opened 6 years ago

tobgu commented 6 years ago

The Go stdlib documentation states that sort.Sort does not guarantee stability of the sorted results (sort.Stable does). Isn't stability a requirement when sorting the dataframe according to the content of multiple columns to guarantee correctness the way dataframe.Arrange is currently implemented?

kniren commented 6 years ago

You might be right. I'm curious to know if you could point an example to me that can cause a problem when using sort.Sort and how big of an impact can this be sorting performance.

My guess is that the memory copy time greatly outweights the extra overhead that O(n*log(n)*log(n)) calls to the swap function can cause, but I would like to get some numbers on this before pulling the trigger and changing sort.Sort to sort.Stable for Series.Order.

tobgu commented 6 years ago

Here's an example of a test that demonstrates the issue:

func generateInts(seed int64, size, intRange int) []int {
    result := make([]int, size)
    rand.Seed(seed)
    // Random slice
    for ix := range result {
        result[ix] = rand.Intn(intRange)
    }

    return result
}

func TestDataFrame_Sort(t *testing.T) {
    size := 10
    d1 := dataframe.New(
        series.New(generateInts(1, size, 3), series.Int, "S1"),
        series.New(generateInts(1, size, 2), series.Int, "S2"))

    d2 := d1.Arrange(dataframe.Sort("S1"), dataframe.Sort("S2"))
    fmt.Println(d2.String())
    previousS1, previousS2 := d2.Elem(0, 0), d2.Elem(0, 1)
    for i := 1; i < size; i++ {
        currentS1, currentS2 := d2.Elem(i, 0), d2.Elem(i, 1)
        if !(previousS1.Less(currentS1) || previousS2.LessEq(currentS2)) {
                t.Errorf("%d: previousS1=%s, previousS2=%s, currentS1=%s, currentS2=%s",
                    i-1, previousS1.String(), previousS2.String(), currentS1.String(), currentS2.String())
        }
        previousS1, previousS2 = currentS1, currentS2
    }
}

Performance of sort.Stable is notably worse than that of sort.Sort. Here's a benchmark using a dataframe with 4x100000 random integers (range 0-100000) sorted on two columns, one ascending and one descending:

BenchmarkDataFrame_Sort-2              5     218578662 ns/op    50547024 B/op        148 allocs/op
BenchmarkDataFrame_SortStable-2        3     435763757 ns/op    50547024 B/op        148 allocs/op
BenchmarkQFrame_Sort-2                30      48992845 ns/op      401536 B/op          4 allocs/op

The QFrame benchmark is from a dataframe-like construct that I'm writing to support a Go implementation of https://github.com/tobgu/qcache. Implementing that was how I stumbled over this issue. I'm not actually affected by it as a user. For the QFrame I opted for a slightly different approach where I instead of sorting series per series i compare row by row. This is slower for single columns sorts but the time is not growing linearly with the number of columns you sort on. Something that was important to me since I currently base grouping, etc. on a sorted frame. The performance difference between dataframe and QFrame is likely because of less allocations and less abstractions/interfaces. I'm trying to get it on par with Pandas performance-wise. It's just a prototype at the moment.

A slight optimisation that will produce correct results could be to use sort.Sort for the first column to sort and sort.Stable for the rest. I guess it's fairly common to only sort by one column.

kniren commented 4 years ago

If someone wants to submit a PR for stable sort as described by @tobgu, I'll merge it.

kniren commented 4 years ago

Woops I accidentally closed this issue sorry.