Open kniren opened 8 years ago
I ran some benchmarks using some of the base methods that can be used for operating on DataFrames. The dataset used was nycflights13 that consists on 336776 rows and 20 columns. The benchmark code was the following:
f, _ := os.Open("data/nycflights13.csv")
flights := dataframe.ReadCSV(f)
b.Run("Filter_1", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Filter(dataframe.F{"month", "==", 1})
}
})
b.Run("Filter_2", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Filter(dataframe.F{"month", "==", 1}).
Filter(dataframe.F{"day", "==", 1})
}
})
b.Run("Filter_3", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Filter(dataframe.F{"month", "==", 1}).
Filter(dataframe.F{"day", "==", 1}).
Filter(dataframe.F{"hour", "<=", 12})
}
})
b.Run("Arrange_1", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Arrange(
dataframe.Order{"arr_delay", true},
)
}
})
b.Run("Arrange_2", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Arrange(
dataframe.Order{"year", false},
dataframe.Order{"month", false},
)
}
})
b.Run("Arrange_3", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Arrange(
dataframe.Order{"year", false},
dataframe.Order{"month", false},
dataframe.Order{"day", false},
)
}
})
b.Run("Select_1", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Select([]string{"year"})
}
})
b.Run("Select_2", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Select([]string{"year", "month"})
}
})
b.Run("Select_3", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Select([]string{"year", "month", "day"})
}
})
var index10 []int
for i := 0; i < 10; i++ {
index10 = append(index10, i)
}
var index100 []int
for i := 0; i < 100; i++ {
index100 = append(index100, i)
}
var index1000 []int
for i := 0; i < 1000; i++ {
index1000 = append(index1000, i)
}
sub10 := flights.Subset(index10)
sub100 := flights.Subset(index100)
sub1000 := flights.Subset(index1000)
b.Run("Subset_10", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Subset(index10)
}
})
b.Run("Subset_100", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Subset(index100)
}
})
b.Run("Subset_1000", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Subset(index1000)
}
})
b.Run("Set_10", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Set(index10, sub10)
}
})
b.Run("Set_100", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Set(index100, sub100)
}
})
b.Run("Set_1000", func(b *testing.B) {
for i := 0; i < b.N; i++ {
flights.Set(index1000, sub1000)
}
})
The two most obvious improvements that could be extracted from the cpu profile were that it seemed that a lot of small memory allocations were taking place and thus the GC was being impacted very heavily and also a lot of reflection was being made when new Series
where copied or created. In order to address the first issue, slices pre-allocation has taken over growing slices, and for the second one, type assertion to check the given values is added prior to reflection and if the given types are not recognised then it defaults to use reflection.
With these two changes, already present on ffcfe13, we could observe a significant speedup with the same dataset:
benchmark old ns/op new ns/op delta
BenchmarkAll/Filter_1-4 278527790 119058640 -57.25%
BenchmarkAll/Filter_2-4 272421441 125048692 -54.10%
BenchmarkAll/Filter_3-4 280313297 126590535 -54.84%
BenchmarkAll/Arrange_1-4 3044940770 1342395569 -55.91%
BenchmarkAll/Arrange_2-4 4140007971 1527648847 -63.10%
BenchmarkAll/Arrange_3-4 6168188016 2275641266 -63.11%
BenchmarkAll/Select_1-4 48459199 14510337 -70.06%
BenchmarkAll/Select_2-4 92780819 28908778 -68.84%
BenchmarkAll/Select_3-4 132650059 44706709 -66.30%
BenchmarkAll/Subset_10-4 75320 30194 -59.91%
BenchmarkAll/Subset_100-4 419068 220374 -47.41%
BenchmarkAll/Subset_1000-4 3323734 2029517 -38.94%
BenchmarkAll/Set_10-4 1955485629 656503486 -66.43%
BenchmarkAll/Set_100-4 1889303947 638403492 -66.21%
BenchmarkAll/Set_1000-4 1979352270 678069517 -65.74%
Further optimization is still on the cards by improving the algorithms, reducing memory allocations or using concurrency for some processes. For the latter, I've done some experiments on the dataframe.New
function by using goroutines to perform the copy of the given Series
. This results in a performance increase all over the board, but I haven't included it on the current dev
branch just yet, since I wan't to make sure that using this kind of pattern won't affect users trying to made their code concurrent on their side.
As of c9b8f464475e206370fdf8e7ce2b0f77ec0542f1 I've optimized the DataFrame.Arrange
algorithm for a good increase in speed when sorting by more than one column:
benchmark old ns/op new ns/op delta
BenchmarkAll/Arrange_1-4 1342395569 1422965316 +6.00%
BenchmarkAll/Arrange_2-4 1527648847 888996432 -41.81%
BenchmarkAll/Arrange_3-4 2275641266 1025177764 -54.95%
Additionally, DataFrame.Set
and Series.Set
now perform the setting in place to avoid copying the entire structure. If one needs to make sure the original DataFrame
is not modified, one can resort to DataFrame.Copy().Set(...)
.
I've been exploring the performance of Gota on the same dataset when comparing it with Pandas and R/Dplyr. The results are quite promising! For the majority of the cases Gota is slower than both other systems.
The bottleneck seems to be that due to the data structures used, the big number of small memory allocations seem to hit the garbage collector pretty heavily specially when copying the data (Which due to the nature of the library, happens quite often). When modifying the series
package to use different structures where the data is stored continuously we can observe a significant speedup on the majority of the benchmarks. Unfortunately, on the current experimental optimisation the readability and maintainability of the code is poor so I will refrain of merging the modifications until I study if there is a better way of handle this problem.
AVG (10 times) | GOTA 255f427 (GOLANG v1.7.3) [ns] | GOTA experimental(1a6c1aa641d0) (GOLANG v1.7.3) [ns] | (DPLYR) (R v3.3.0) [ns] | Pandas (python v3.4.2) [ns] | Time Ratio (Gota 255f427/R) | Time Ratio (Gota 255f427/Pandas) | Time Ratio (Gota experimental/R) | Time Ratio (Gota experimental/Pandas) |
---|---|---|---|---|---|---|---|---|
Load Data | 3796092757.0000003 | 2792890627.0000002 | 5177613973.61755 | 867149114.6087646 | 0.733174156347486 | 4.37767010661448 | 0.5394165422975 | 3.22077319799846 |
Filtering (1) | 87780355 | 120904356 | 5290218.83010864 | 4029681.68258667 | 16.5929534900161 | 21.7834464144705 | 22.8543203755367 | 30.0034507744024 |
Filtering (2) | 94769640 | 124575783 | 8933675.28915405 | 3119330.406188965 | 10.6081357260718 | 30.3814048720105 | 13.9445165587383 | 39.936706529335 |
Filtering (3) | 100542430 | 134784916 | 13318014.1448975 | 6498119.831085205 | 7.5493560005356 | 15.4725416910647 | 10.1204965345107 | 20.7421407274188 |
Arrange (1 col) | 1128079537 | 591918392 | 119221401.2146 | 123551797.86682129 | 9.46205568385698 | 9.13041782051587 | 4.96486692799844 | 4.79085211401002 |
Arrange (2 col) | 587693288 | 305231446 | 75816893.5775757 | 67885780.33447266 | 7.75148202819301 | 8.65708967481025 | 4.02590282451612 | 4.49625009090456 |
Arrange (3 col) | 794083891 | 350509373 | 91589760.7803345 | 79083681.10656738 | 8.67000726101361 | 10.0410587859454 | 3.82694932286862 | 4.4321327497095 |
Select (1) | 15037054 | 2306528 | 218020.89214325 | 1894199.8481750488 | 68.9707020835414 | 7.93847281451708 | 10.5793897884085 | 1.21767932893787 |
Select (2) | 30581870 | 4202700 | 233775.973320007 | 2974939.3463134766 | 130.816993575887 | 10.2798297511164 | 17.9774676598056 | 1.41270107076568 |
Select (3) | 45824883 | 4734341 | 249350.023269653 | 3899350.166320801 | 183.777335967777 | 11.7519281535153 | 18.9867277248263 | 1.21413589394744 |
Subset (First 10 rows) | 16326 | 12913 | 1104371.54769897 | 143039.22653198242 | 0.0147830682835105 | 0.114136523216935 | 0.0116926228558723 | 0.0902759355812984 |
Subset (First 100 rows) | 121928 | 76094 | 1128201.48468018 | 161719.32220458984 | 0.108072894474664 | 0.753948250201976 | 0.0674471723652898 | 0.470531281993218 |
Subset (First 1000 rows) | 1118375 | 677757 | 1259872.9133606 | 144600.86822509766 | 0.887688740776904 | 7.73422050453421 | 0.537956640556818 | 4.68708803978236 |
Set (First 10 rows) | 333080426 | 64648001 | 28951330.1849365 | 81281099.31945801 | 11.5048401531928 | 4.09788288776581 | 2.23298897104343 | 0.795363270689965 |
Set (First 100 rows) | 362018982 | 65256465 | 31732535.3622437 | 81956005.09643555 | 11.4084480760003 | 4.41723558357952 | 2.05645291985222 | 0.796237748816751 |
Set (First 100 rows) | 332827597 | 68338569 | 29457306.8618774 | 81333649.15847778 | 11.2986431027316 | 4.09212669594461 | 2.31991910599408 | 0.840225044702507 |
After studying the problem I've settled for a solution that is almost as performant as the experimental solution but don't change the overall structure of the code significantly, so it should be as easy to maintain as it was previously. Essentially the main sticking point of the performance was that storing the elements of a Series
as []Element
requires a significant amount of interface building, many small allocations and it seems that the memory was not consecutive, which made the garbage collector busy.
Changing the main data structure of Series
to Elements
allows to have consecutive memory with the concrete implementation of the different types. Furthermore, to avoid memory fragmentation and the creation of many small pointers, the concrete implementation of the Element
interface for the four types has been changed to not store a pointer to the value:
type intElement struct {
e *int
}
And now instead stores the data inside the structure and an additional flag for describing if the element is NaN
or not:
type intElement struct {
e int
nan bool
}
With these changes we observe a significant speedup with respect to the unoptimised version and we start to get closer to the performance achieved by Pandas/R:
benchmark old ns/op new ns/op delta
BenchmarkSeries_New/[]bool(100000)_Int-4 8313278 4980892 -40.09%
BenchmarkSeries_New/[]bool(100000)_String-4 9298998 5540847 -40.41%
BenchmarkSeries_New/[]bool(100000)_Bool-4 6366129 3843971 -39.62%
BenchmarkSeries_New/[]bool(100000)_Float-4 8377768 5009936 -40.20%
BenchmarkSeries_New/[]string(100000)_Int-4 18303166 13586413 -25.77%
BenchmarkSeries_New/[]string(100000)_String-4 11978478 8307875 -30.64%
BenchmarkSeries_New/[]string(100000)_Bool-4 28817480 26654172 -7.51%
BenchmarkSeries_New/[]string(100000)_Float-4 30509838 27914009 -8.51%
BenchmarkSeries_New/[]float64(100000)_Int-4 7813046 5002564 -35.97%
BenchmarkSeries_New/[]float64(100000)_String-4 52620115 47544694 -9.65%
BenchmarkSeries_New/[]float64(100000)_Bool-4 8122474 4490235 -44.72%
BenchmarkSeries_New/[]float64(100000)_Float-4 7538372 5033752 -33.22%
BenchmarkSeries_New/[]int(100000)_Int-4 7531595 5063625 -32.77%
BenchmarkSeries_New/[]int(100000)_String-4 19670643 16772118 -14.74%
BenchmarkSeries_New/[]int(100000)_Bool-4 7826898 4548778 -41.88%
BenchmarkSeries_New/[]int(100000)_Float-4 7589070 4962150 -34.61%
BenchmarkSeries_Copy/[]int(100000)_Int-4 3830732 780120 -79.64%
BenchmarkSeries_Copy/[]int(100000)_String-4 5847982 3473473 -40.60%
BenchmarkSeries_Copy/[]int(100000)_Bool-4 1780169 101940 -94.27%
BenchmarkSeries_Copy/[]int(100000)_Float-4 3756647 775425 -79.36%
BenchmarkSeries_Subset/[]int(100000)_Int-4 402341 205062 -49.03%
BenchmarkSeries_Subset/[]int(100000)_String-4 576282 281180 -51.21%
BenchmarkSeries_Subset/[]int(100000)_Bool-4 215667 152118 -29.47%
BenchmarkSeries_Subset/[]int(100000)_Float-4 405079 210860 -47.95%
BenchmarkSeries_Set/[]int(100000)_Int-4 623275 535365 -14.10%
BenchmarkSeries_Set/[]int(100000)_String-4 814689 519182 -36.27%
BenchmarkSeries_Set/[]int(100000)_Bool-4 546948 537051 -1.81%
BenchmarkSeries_Set/[]int(100000)_Float-4 630201 511646 -18.81%
AVG (10 times) | GOTA experimental(6cb756f) (GOLANG v1.7.3) [ns] | (DPLYR) (R v3.3.0) [ns] | Pandas (python v3.4.2) [ns] | Time Ratio (Gota 6cb756f/R) | Time Ratio (Gota 6cb756f/Pandas) |
---|---|---|---|---|---|
Load Data | 3550186270 | 5177613973.61755 | 867149114.6087646 | 0.685679984658941 | 4.09408971328046 |
Filtering (1) | 60191373 | 5290218.83010864 | 4029681.68258667 | 11.37786071484 | 14.937004394194 |
Filtering (2) | 76414594 | 8933675.28915405 | 3119330.406188965 | 8.55354504464376 | 24.4971144603304 |
Filtering (3) | 75358728 | 13318014.1448975 | 6498119.831085205 | 5.65840576381067 | 11.5970049735779 |
Arrange (1 col) | 686606598 | 119221401.2146 | 123551797.86682129 | 5.75908847744626 | 5.55723680152437 |
Arrange (2 col) | 358894188 | 75816893.5775757 | 67885780.33447266 | 4.73369681959839 | 5.28673584116926 |
Arrange (3 col) | 431656827 | 91589760.7803345 | 79083681.10656738 | 4.71293759610607 | 5.45822881484653 |
Select (1) | 3657266 | 218020.89214325 | 1894199.8481750488 | 16.774841915595 | 1.93077092869771 |
Select (2) | 7200508 | 233775.973320007 | 2974939.3463134766 | 30.8008898337191 | 2.42038816990431 |
Select (3) | 9918440 | 249350.023269653 | 3899350.166320801 | 39.7771769576856 | 2.543613570709 |
Subset (First 10 rows) | 11094 | 1104371.54769897 | 143039.22653198242 | 0.0100455322514557 | 0.0775591442220185 |
Subset (First 100 rows) | 56498 | 1128201.48468018 | 161719.32220458984 | 0.0500779344533622 | 0.349358377402329 |
Subset (First 1000 rows) | 498362 | 1259872.9133606 | 144600.86822509766 | 0.395565294495191 | 3.44646616660841 |
Set (First 10 rows) | 73870225 | 28951330.1849365 | 81281099.31945801 | 2.5515312950434 | 0.908824137696131 |
Set (First 100 rows) | 69103630 | 31732535.3622437 | 81956005.09643555 | 2.17769015967824 | 0.843179580540651 |
Set (First 100 rows) | 82565295 | 29457306.8618774 | 81333649.15847778 | 2.80287995732743 | 1.01514312777387 |
This has been merged on dev
as of 6cb756f
So far there has not been an study on the performance of this library in terms of speed and memory consumption. I'm prioritising now those features that impact the users directly, since the API design is still on flux, but this should be addressed on the near future.