go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
3.09k stars 284 forks source link

Profiling, memory usage and performance review #16

Open kniren opened 8 years ago

kniren commented 8 years ago

So far there has not been an study on the performance of this library in terms of speed and memory consumption. I'm prioritising now those features that impact the users directly, since the API design is still on flux, but this should be addressed on the near future.

kniren commented 8 years ago

I ran some benchmarks using some of the base methods that can be used for operating on DataFrames. The dataset used was nycflights13 that consists on 336776 rows and 20 columns. The benchmark code was the following:

    f, _ := os.Open("data/nycflights13.csv")
    flights := dataframe.ReadCSV(f)
    b.Run("Filter_1", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Filter(dataframe.F{"month", "==", 1})
        }
    })
    b.Run("Filter_2", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Filter(dataframe.F{"month", "==", 1}).
                Filter(dataframe.F{"day", "==", 1})
        }
    })
    b.Run("Filter_3", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Filter(dataframe.F{"month", "==", 1}).
                Filter(dataframe.F{"day", "==", 1}).
                Filter(dataframe.F{"hour", "<=", 12})
        }
    })
    b.Run("Arrange_1", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Arrange(
                dataframe.Order{"arr_delay", true},
            )
        }
    })
    b.Run("Arrange_2", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Arrange(
                dataframe.Order{"year", false},
                dataframe.Order{"month", false},
            )
        }
    })
    b.Run("Arrange_3", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Arrange(
                dataframe.Order{"year", false},
                dataframe.Order{"month", false},
                dataframe.Order{"day", false},
            )
        }
    })
    b.Run("Select_1", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Select([]string{"year"})
        }
    })
    b.Run("Select_2", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Select([]string{"year", "month"})
        }
    })
    b.Run("Select_3", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Select([]string{"year", "month", "day"})
        }
    })
    var index10 []int
    for i := 0; i < 10; i++ {
        index10 = append(index10, i)
    }
    var index100 []int
    for i := 0; i < 100; i++ {
        index100 = append(index100, i)
    }
    var index1000 []int
    for i := 0; i < 1000; i++ {
        index1000 = append(index1000, i)
    }
    sub10 := flights.Subset(index10)
    sub100 := flights.Subset(index100)
    sub1000 := flights.Subset(index1000)
    b.Run("Subset_10", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Subset(index10)
        }
    })
    b.Run("Subset_100", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Subset(index100)
        }
    })
    b.Run("Subset_1000", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Subset(index1000)
        }
    })
    b.Run("Set_10", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Set(index10, sub10)
        }
    })
    b.Run("Set_100", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Set(index100, sub100)
        }
    })
    b.Run("Set_1000", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Set(index1000, sub1000)
        }
    })

The two most obvious improvements that could be extracted from the cpu profile were that it seemed that a lot of small memory allocations were taking place and thus the GC was being impacted very heavily and also a lot of reflection was being made when new Series where copied or created. In order to address the first issue, slices pre-allocation has taken over growing slices, and for the second one, type assertion to check the given values is added prior to reflection and if the given types are not recognised then it defaults to use reflection.

With these two changes, already present on ffcfe13, we could observe a significant speedup with the same dataset:

benchmark                      old ns/op      new ns/op      delta
BenchmarkAll/Filter_1-4        278527790      119058640      -57.25%
BenchmarkAll/Filter_2-4        272421441      125048692      -54.10%
BenchmarkAll/Filter_3-4        280313297      126590535      -54.84%
BenchmarkAll/Arrange_1-4       3044940770     1342395569     -55.91%
BenchmarkAll/Arrange_2-4       4140007971     1527648847     -63.10%
BenchmarkAll/Arrange_3-4       6168188016     2275641266     -63.11%
BenchmarkAll/Select_1-4        48459199       14510337       -70.06%
BenchmarkAll/Select_2-4        92780819       28908778       -68.84%
BenchmarkAll/Select_3-4        132650059      44706709       -66.30%
BenchmarkAll/Subset_10-4       75320          30194          -59.91%
BenchmarkAll/Subset_100-4      419068         220374         -47.41%
BenchmarkAll/Subset_1000-4     3323734        2029517        -38.94%
BenchmarkAll/Set_10-4          1955485629     656503486      -66.43%
BenchmarkAll/Set_100-4         1889303947     638403492      -66.21%
BenchmarkAll/Set_1000-4        1979352270     678069517      -65.74%

Further optimization is still on the cards by improving the algorithms, reducing memory allocations or using concurrency for some processes. For the latter, I've done some experiments on the dataframe.New function by using goroutines to perform the copy of the given Series. This results in a performance increase all over the board, but I haven't included it on the current dev branch just yet, since I wan't to make sure that using this kind of pattern won't affect users trying to made their code concurrent on their side.

kniren commented 7 years ago

As of c9b8f464475e206370fdf8e7ce2b0f77ec0542f1 I've optimized the DataFrame.Arrange algorithm for a good increase in speed when sorting by more than one column:

benchmark                    old ns/op      new ns/op      delta
BenchmarkAll/Arrange_1-4     1342395569     1422965316     +6.00%
BenchmarkAll/Arrange_2-4     1527648847     888996432      -41.81%
BenchmarkAll/Arrange_3-4     2275641266     1025177764     -54.95%

Additionally, DataFrame.Set and Series.Set now perform the setting in place to avoid copying the entire structure. If one needs to make sure the original DataFrame is not modified, one can resort to DataFrame.Copy().Set(...).

kniren commented 7 years ago

I've been exploring the performance of Gota on the same dataset when comparing it with Pandas and R/Dplyr. The results are quite promising! For the majority of the cases Gota is slower than both other systems.

The bottleneck seems to be that due to the data structures used, the big number of small memory allocations seem to hit the garbage collector pretty heavily specially when copying the data (Which due to the nature of the library, happens quite often). When modifying the series package to use different structures where the data is stored continuously we can observe a significant speedup on the majority of the benchmarks. Unfortunately, on the current experimental optimisation the readability and maintainability of the code is poor so I will refrain of merging the modifications until I study if there is a better way of handle this problem.

AVG (10 times) GOTA 255f427 (GOLANG v1.7.3) [ns] GOTA experimental(1a6c1aa641d0) (GOLANG v1.7.3) [ns] (DPLYR) (R v3.3.0) [ns] Pandas (python v3.4.2) [ns] Time Ratio (Gota 255f427/R) Time Ratio (Gota 255f427/Pandas) Time Ratio (Gota experimental/R) Time Ratio (Gota experimental/Pandas)
Load Data 3796092757.0000003 2792890627.0000002 5177613973.61755 867149114.6087646 0.733174156347486 4.37767010661448 0.5394165422975 3.22077319799846
Filtering (1) 87780355 120904356 5290218.83010864 4029681.68258667 16.5929534900161 21.7834464144705 22.8543203755367 30.0034507744024
Filtering (2) 94769640 124575783 8933675.28915405 3119330.406188965 10.6081357260718 30.3814048720105 13.9445165587383 39.936706529335
Filtering (3) 100542430 134784916 13318014.1448975 6498119.831085205 7.5493560005356 15.4725416910647 10.1204965345107 20.7421407274188
Arrange (1 col) 1128079537 591918392 119221401.2146 123551797.86682129 9.46205568385698 9.13041782051587 4.96486692799844 4.79085211401002
Arrange (2 col) 587693288 305231446 75816893.5775757 67885780.33447266 7.75148202819301 8.65708967481025 4.02590282451612 4.49625009090456
Arrange (3 col) 794083891 350509373 91589760.7803345 79083681.10656738 8.67000726101361 10.0410587859454 3.82694932286862 4.4321327497095
Select (1) 15037054 2306528 218020.89214325 1894199.8481750488 68.9707020835414 7.93847281451708 10.5793897884085 1.21767932893787
Select (2) 30581870 4202700 233775.973320007 2974939.3463134766 130.816993575887 10.2798297511164 17.9774676598056 1.41270107076568
Select (3) 45824883 4734341 249350.023269653 3899350.166320801 183.777335967777 11.7519281535153 18.9867277248263 1.21413589394744
Subset (First 10 rows) 16326 12913 1104371.54769897 143039.22653198242 0.0147830682835105 0.114136523216935 0.0116926228558723 0.0902759355812984
Subset (First 100 rows) 121928 76094 1128201.48468018 161719.32220458984 0.108072894474664 0.753948250201976 0.0674471723652898 0.470531281993218
Subset (First 1000 rows) 1118375 677757 1259872.9133606 144600.86822509766 0.887688740776904 7.73422050453421 0.537956640556818 4.68708803978236
Set (First 10 rows) 333080426 64648001 28951330.1849365 81281099.31945801 11.5048401531928 4.09788288776581 2.23298897104343 0.795363270689965
Set (First 100 rows) 362018982 65256465 31732535.3622437 81956005.09643555 11.4084480760003 4.41723558357952 2.05645291985222 0.796237748816751
Set (First 100 rows) 332827597 68338569 29457306.8618774 81333649.15847778 11.2986431027316 4.09212669594461 2.31991910599408 0.840225044702507
kniren commented 7 years ago

After studying the problem I've settled for a solution that is almost as performant as the experimental solution but don't change the overall structure of the code significantly, so it should be as easy to maintain as it was previously. Essentially the main sticking point of the performance was that storing the elements of a Series as []Element requires a significant amount of interface building, many small allocations and it seems that the memory was not consecutive, which made the garbage collector busy.

Changing the main data structure of Series to Elements allows to have consecutive memory with the concrete implementation of the different types. Furthermore, to avoid memory fragmentation and the creation of many small pointers, the concrete implementation of the Element interface for the four types has been changed to not store a pointer to the value:

type intElement struct {
    e *int
}

And now instead stores the data inside the structure and an additional flag for describing if the element is NaN or not:

type intElement struct {
    e     int
    nan bool
}

With these changes we observe a significant speedup with respect to the unoptimised version and we start to get closer to the performance achieved by Pandas/R:

benchmark                                          old ns/op     new ns/op     delta
BenchmarkSeries_New/[]bool(100000)_Int-4           8313278       4980892       -40.09%
BenchmarkSeries_New/[]bool(100000)_String-4        9298998       5540847       -40.41%
BenchmarkSeries_New/[]bool(100000)_Bool-4          6366129       3843971       -39.62%
BenchmarkSeries_New/[]bool(100000)_Float-4         8377768       5009936       -40.20%
BenchmarkSeries_New/[]string(100000)_Int-4         18303166      13586413      -25.77%
BenchmarkSeries_New/[]string(100000)_String-4      11978478      8307875       -30.64%
BenchmarkSeries_New/[]string(100000)_Bool-4        28817480      26654172      -7.51%
BenchmarkSeries_New/[]string(100000)_Float-4       30509838      27914009      -8.51%
BenchmarkSeries_New/[]float64(100000)_Int-4        7813046       5002564       -35.97%
BenchmarkSeries_New/[]float64(100000)_String-4     52620115      47544694      -9.65%
BenchmarkSeries_New/[]float64(100000)_Bool-4       8122474       4490235       -44.72%
BenchmarkSeries_New/[]float64(100000)_Float-4      7538372       5033752       -33.22%
BenchmarkSeries_New/[]int(100000)_Int-4            7531595       5063625       -32.77%
BenchmarkSeries_New/[]int(100000)_String-4         19670643      16772118      -14.74%
BenchmarkSeries_New/[]int(100000)_Bool-4           7826898       4548778       -41.88%
BenchmarkSeries_New/[]int(100000)_Float-4          7589070       4962150       -34.61%
BenchmarkSeries_Copy/[]int(100000)_Int-4           3830732       780120        -79.64%
BenchmarkSeries_Copy/[]int(100000)_String-4        5847982       3473473       -40.60%
BenchmarkSeries_Copy/[]int(100000)_Bool-4          1780169       101940        -94.27%
BenchmarkSeries_Copy/[]int(100000)_Float-4         3756647       775425        -79.36%
BenchmarkSeries_Subset/[]int(100000)_Int-4         402341        205062        -49.03%
BenchmarkSeries_Subset/[]int(100000)_String-4      576282        281180        -51.21%
BenchmarkSeries_Subset/[]int(100000)_Bool-4        215667        152118        -29.47%
BenchmarkSeries_Subset/[]int(100000)_Float-4       405079        210860        -47.95%
BenchmarkSeries_Set/[]int(100000)_Int-4            623275        535365        -14.10%
BenchmarkSeries_Set/[]int(100000)_String-4         814689        519182        -36.27%
BenchmarkSeries_Set/[]int(100000)_Bool-4           546948        537051        -1.81%
BenchmarkSeries_Set/[]int(100000)_Float-4          630201        511646        -18.81%
AVG (10 times) GOTA experimental(6cb756f) (GOLANG v1.7.3) [ns] (DPLYR) (R v3.3.0) [ns] Pandas (python v3.4.2) [ns] Time Ratio (Gota 6cb756f/R) Time Ratio (Gota 6cb756f/Pandas)
Load Data 3550186270 5177613973.61755 867149114.6087646 0.685679984658941 4.09408971328046
Filtering (1) 60191373 5290218.83010864 4029681.68258667 11.37786071484 14.937004394194
Filtering (2) 76414594 8933675.28915405 3119330.406188965 8.55354504464376 24.4971144603304
Filtering (3) 75358728 13318014.1448975 6498119.831085205 5.65840576381067 11.5970049735779
Arrange (1 col) 686606598 119221401.2146 123551797.86682129 5.75908847744626 5.55723680152437
Arrange (2 col) 358894188 75816893.5775757 67885780.33447266 4.73369681959839 5.28673584116926
Arrange (3 col) 431656827 91589760.7803345 79083681.10656738 4.71293759610607 5.45822881484653
Select (1) 3657266 218020.89214325 1894199.8481750488 16.774841915595 1.93077092869771
Select (2) 7200508 233775.973320007 2974939.3463134766 30.8008898337191 2.42038816990431
Select (3) 9918440 249350.023269653 3899350.166320801 39.7771769576856 2.543613570709
Subset (First 10 rows) 11094 1104371.54769897 143039.22653198242 0.0100455322514557 0.0775591442220185
Subset (First 100 rows) 56498 1128201.48468018 161719.32220458984 0.0500779344533622 0.349358377402329
Subset (First 1000 rows) 498362 1259872.9133606 144600.86822509766 0.395565294495191 3.44646616660841
Set (First 10 rows) 73870225 28951330.1849365 81281099.31945801 2.5515312950434 0.908824137696131
Set (First 100 rows) 69103630 31732535.3622437 81956005.09643555 2.17769015967824 0.843179580540651
Set (First 100 rows) 82565295 29457306.8618774 81333649.15847778 2.80287995732743 1.01514312777387

This has been merged on dev as of 6cb756f