Open tobgu opened 6 years ago
You might be right. I'm curious to know if you could point an example to me that can cause a problem when using sort.Sort
and how big of an impact can this be sorting performance.
My guess is that the memory copy time greatly outweights the extra overhead that O(n*log(n)*log(n))
calls to the swap function can cause, but I would like to get some numbers on this before pulling the trigger and changing sort.Sort
to sort.Stable
for Series.Order
.
Here's an example of a test that demonstrates the issue:
func generateInts(seed int64, size, intRange int) []int {
result := make([]int, size)
rand.Seed(seed)
// Random slice
for ix := range result {
result[ix] = rand.Intn(intRange)
}
return result
}
func TestDataFrame_Sort(t *testing.T) {
size := 10
d1 := dataframe.New(
series.New(generateInts(1, size, 3), series.Int, "S1"),
series.New(generateInts(1, size, 2), series.Int, "S2"))
d2 := d1.Arrange(dataframe.Sort("S1"), dataframe.Sort("S2"))
fmt.Println(d2.String())
previousS1, previousS2 := d2.Elem(0, 0), d2.Elem(0, 1)
for i := 1; i < size; i++ {
currentS1, currentS2 := d2.Elem(i, 0), d2.Elem(i, 1)
if !(previousS1.Less(currentS1) || previousS2.LessEq(currentS2)) {
t.Errorf("%d: previousS1=%s, previousS2=%s, currentS1=%s, currentS2=%s",
i-1, previousS1.String(), previousS2.String(), currentS1.String(), currentS2.String())
}
previousS1, previousS2 = currentS1, currentS2
}
}
Performance of sort.Stable
is notably worse than that of sort.Sort
. Here's a benchmark using a dataframe with 4x100000 random integers (range 0-100000) sorted on two columns, one ascending and one descending:
BenchmarkDataFrame_Sort-2 5 218578662 ns/op 50547024 B/op 148 allocs/op
BenchmarkDataFrame_SortStable-2 3 435763757 ns/op 50547024 B/op 148 allocs/op
BenchmarkQFrame_Sort-2 30 48992845 ns/op 401536 B/op 4 allocs/op
The QFrame benchmark is from a dataframe-like construct that I'm writing to support a Go implementation of https://github.com/tobgu/qcache. Implementing that was how I stumbled over this issue. I'm not actually affected by it as a user. For the QFrame I opted for a slightly different approach where I instead of sorting series per series i compare row by row. This is slower for single columns sorts but the time is not growing linearly with the number of columns you sort on. Something that was important to me since I currently base grouping, etc. on a sorted frame. The performance difference between dataframe and QFrame is likely because of less allocations and less abstractions/interfaces. I'm trying to get it on par with Pandas performance-wise. It's just a prototype at the moment.
A slight optimisation that will produce correct results could be to use sort.Sort
for the first column to sort and sort.Stable
for the rest. I guess it's fairly common to only sort by one column.
If someone wants to submit a PR for stable sort as described by @tobgu, I'll merge it.
Woops I accidentally closed this issue sorry.
The Go stdlib documentation states that
sort.Sort
does not guarantee stability of the sorted results (sort.Stable
does). Isn't stability a requirement when sorting the dataframe according to the content of multiple columns to guarantee correctness the waydataframe.Arrange
is currently implemented?