Profiling, memory usage and performance review

So far there has not been an study on the performance of this library in terms of speed and memory consumption. I'm prioritising now those features that impact the users directly, since the API design is still on flux, but this should be addressed on the near future.

I ran some benchmarks using some of the base methods that can be used for operating on DataFrames. The dataset used was nycflights13 that consists on 336776 rows and 20 columns. The benchmark code was the following:

    f, _ := os.Open("data/nycflights13.csv")
    flights := dataframe.ReadCSV(f)
    b.Run("Filter_1", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Filter(dataframe.F{"month", "==", 1})
        }
    })
    b.Run("Filter_2", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Filter(dataframe.F{"month", "==", 1}).
                Filter(dataframe.F{"day", "==", 1})
        }
    })
    b.Run("Filter_3", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Filter(dataframe.F{"month", "==", 1}).
                Filter(dataframe.F{"day", "==", 1}).
                Filter(dataframe.F{"hour", "<=", 12})
        }
    })
    b.Run("Arrange_1", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Arrange(
                dataframe.Order{"arr_delay", true},
            )
        }
    })
    b.Run("Arrange_2", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Arrange(
                dataframe.Order{"year", false},
                dataframe.Order{"month", false},
            )
        }
    })
    b.Run("Arrange_3", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Arrange(
                dataframe.Order{"year", false},
                dataframe.Order{"month", false},
                dataframe.Order{"day", false},
            )
        }
    })
    b.Run("Select_1", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Select([]string{"year"})
        }
    })
    b.Run("Select_2", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Select([]string{"year", "month"})
        }
    })
    b.Run("Select_3", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Select([]string{"year", "month", "day"})
        }
    })
    var index10 []int
    for i := 0; i < 10; i++ {
        index10 = append(index10, i)
    }
    var index100 []int
    for i := 0; i < 100; i++ {
        index100 = append(index100, i)
    }
    var index1000 []int
    for i := 0; i < 1000; i++ {
        index1000 = append(index1000, i)
    }
    sub10 := flights.Subset(index10)
    sub100 := flights.Subset(index100)
    sub1000 := flights.Subset(index1000)
    b.Run("Subset_10", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Subset(index10)
        }
    })
    b.Run("Subset_100", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Subset(index100)
        }
    })
    b.Run("Subset_1000", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Subset(index1000)
        }
    })
    b.Run("Set_10", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Set(index10, sub10)
        }
    })
    b.Run("Set_100", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Set(index100, sub100)
        }
    })
    b.Run("Set_1000", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            flights.Set(index1000, sub1000)
        }
    })

The two most obvious improvements that could be extracted from the cpu profile were that it seemed that a lot of small memory allocations were taking place and thus the GC was being impacted very heavily and also a lot of reflection was being made when new Series where copied or created. In order to address the first issue, slices pre-allocation has taken over growing slices, and for the second one, type assertion to check the given values is added prior to reflection and if the given types are not recognised then it defaults to use reflection.

With these two changes, already present on ffcfe13, we could observe a significant speedup with the same dataset:

benchmark                      old ns/op      new ns/op      delta
BenchmarkAll/Filter_1-4        278527790      119058640      -57.25%
BenchmarkAll/Filter_2-4        272421441      125048692      -54.10%
BenchmarkAll/Filter_3-4        280313297      126590535      -54.84%
BenchmarkAll/Arrange_1-4       3044940770     1342395569     -55.91%
BenchmarkAll/Arrange_2-4       4140007971     1527648847     -63.10%
BenchmarkAll/Arrange_3-4       6168188016     2275641266     -63.11%
BenchmarkAll/Select_1-4        48459199       14510337       -70.06%
BenchmarkAll/Select_2-4        92780819       28908778       -68.84%
BenchmarkAll/Select_3-4        132650059      44706709       -66.30%
BenchmarkAll/Subset_10-4       75320          30194          -59.91%
BenchmarkAll/Subset_100-4      419068         220374         -47.41%
BenchmarkAll/Subset_1000-4     3323734        2029517        -38.94%
BenchmarkAll/Set_10-4          1955485629     656503486      -66.43%
BenchmarkAll/Set_100-4         1889303947     638403492      -66.21%
BenchmarkAll/Set_1000-4        1979352270     678069517      -65.74%

Further optimization is still on the cards by improving the algorithms, reducing memory allocations or using concurrency for some processes. For the latter, I've done some experiments on the dataframe.New function by using goroutines to perform the copy of the given Series. This results in a performance increase all over the board, but I haven't included it on the current dev branch just yet, since I wan't to make sure that using this kind of pattern won't affect users trying to made their code concurrent on their side.

As of c9b8f464475e206370fdf8e7ce2b0f77ec0542f1 I've optimized the DataFrame.Arrange algorithm for a good increase in speed when sorting by more than one column:

benchmark                    old ns/op      new ns/op      delta
BenchmarkAll/Arrange_1-4     1342395569     1422965316     +6.00%
BenchmarkAll/Arrange_2-4     1527648847     888996432      -41.81%
BenchmarkAll/Arrange_3-4     2275641266     1025177764     -54.95%

Additionally, DataFrame.Set and Series.Set now perform the setting in place to avoid copying the entire structure. If one needs to make sure the original DataFrame is not modified, one can resort to DataFrame.Copy().Set(...).

I've been exploring the performance of Gota on the same dataset when comparing it with Pandas and R/Dplyr. The results are quite promising! For the majority of the cases Gota is slower than both other systems.

The bottleneck seems to be that due to the data structures used, the big number of small memory allocations seem to hit the garbage collector pretty heavily specially when copying the data (Which due to the nature of the library, happens quite often). When modifying the series package to use different structures where the data is stored continuously we can observe a significant speedup on the majority of the benchmarks. Unfortunately, on the current experimental optimisation the readability and maintainability of the code is poor so I will refrain of merging the modifications until I study if there is a better way of handle this problem.

AVG (10 times)	GOTA 255f427 (GOLANG v1.7.3) [ns]	GOTA experimental(1a6c1aa641d0) (GOLANG v1.7.3) [ns]	(DPLYR) (R v3.3.0) [ns]	Pandas (python v3.4.2) [ns]	Time Ratio (Gota 255f427/R)	Time Ratio (Gota 255f427/Pandas)	Time Ratio (Gota experimental/R)	Time Ratio (Gota experimental/Pandas)
Load Data	3796092757.0000003	2792890627.0000002	5177613973.61755	867149114.6087646	0.733174156347486	4.37767010661448	0.5394165422975	3.22077319799846
Filtering (1)	87780355	120904356	5290218.83010864	4029681.68258667	16.5929534900161	21.7834464144705	22.8543203755367	30.0034507744024
Filtering (2)	94769640	124575783	8933675.28915405	3119330.406188965	10.6081357260718	30.3814048720105	13.9445165587383	39.936706529335
Filtering (3)	100542430	134784916	13318014.1448975	6498119.831085205	7.5493560005356	15.4725416910647	10.1204965345107	20.7421407274188
Arrange (1 col)	1128079537	591918392	119221401.2146	123551797.86682129	9.46205568385698	9.13041782051587	4.96486692799844	4.79085211401002
Arrange (2 col)	587693288	305231446	75816893.5775757	67885780.33447266	7.75148202819301	8.65708967481025	4.02590282451612	4.49625009090456
Arrange (3 col)	794083891	350509373	91589760.7803345	79083681.10656738	8.67000726101361	10.0410587859454	3.82694932286862	4.4321327497095
Select (1)	15037054	2306528	218020.89214325	1894199.8481750488	68.9707020835414	7.93847281451708	10.5793897884085	1.21767932893787
Select (2)	30581870	4202700	233775.973320007	2974939.3463134766	130.816993575887	10.2798297511164	17.9774676598056	1.41270107076568
Select (3)	45824883	4734341	249350.023269653	3899350.166320801	183.777335967777	11.7519281535153	18.9867277248263	1.21413589394744
Subset (First 10 rows)	16326	12913	1104371.54769897	143039.22653198242	0.0147830682835105	0.114136523216935	0.0116926228558723	0.0902759355812984
Subset (First 100 rows)	121928	76094	1128201.48468018	161719.32220458984	0.108072894474664	0.753948250201976	0.0674471723652898	0.470531281993218
Subset (First 1000 rows)	1118375	677757	1259872.9133606	144600.86822509766	0.887688740776904	7.73422050453421	0.537956640556818	4.68708803978236
Set (First 10 rows)	333080426	64648001	28951330.1849365	81281099.31945801	11.5048401531928	4.09788288776581	2.23298897104343	0.795363270689965
Set (First 100 rows)	362018982	65256465	31732535.3622437	81956005.09643555	11.4084480760003	4.41723558357952	2.05645291985222	0.796237748816751
Set (First 100 rows)	332827597	68338569	29457306.8618774	81333649.15847778	11.2986431027316	4.09212669594461	2.31991910599408	0.840225044702507

After studying the problem I've settled for a solution that is almost as performant as the experimental solution but don't change the overall structure of the code significantly, so it should be as easy to maintain as it was previously. Essentially the main sticking point of the performance was that storing the elements of a Series as []Element requires a significant amount of interface building, many small allocations and it seems that the memory was not consecutive, which made the garbage collector busy.

Changing the main data structure of Series to Elements allows to have consecutive memory with the concrete implementation of the different types. Furthermore, to avoid memory fragmentation and the creation of many small pointers, the concrete implementation of the Element interface for the four types has been changed to not store a pointer to the value:

type intElement struct {
    e *int
}

And now instead stores the data inside the structure and an additional flag for describing if the element is NaN or not:

type intElement struct {
    e     int
    nan bool
}

With these changes we observe a significant speedup with respect to the unoptimised version and we start to get closer to the performance achieved by Pandas/R:

benchmark                                          old ns/op     new ns/op     delta
BenchmarkSeries_New/[]bool(100000)_Int-4           8313278       4980892       -40.09%
BenchmarkSeries_New/[]bool(100000)_String-4        9298998       5540847       -40.41%
BenchmarkSeries_New/[]bool(100000)_Bool-4          6366129       3843971       -39.62%
BenchmarkSeries_New/[]bool(100000)_Float-4         8377768       5009936       -40.20%
BenchmarkSeries_New/[]string(100000)_Int-4         18303166      13586413      -25.77%
BenchmarkSeries_New/[]string(100000)_String-4      11978478      8307875       -30.64%
BenchmarkSeries_New/[]string(100000)_Bool-4        28817480      26654172      -7.51%
BenchmarkSeries_New/[]string(100000)_Float-4       30509838      27914009      -8.51%
BenchmarkSeries_New/[]float64(100000)_Int-4        7813046       5002564       -35.97%
BenchmarkSeries_New/[]float64(100000)_String-4     52620115      47544694      -9.65%
BenchmarkSeries_New/[]float64(100000)_Bool-4       8122474       4490235       -44.72%
BenchmarkSeries_New/[]float64(100000)_Float-4      7538372       5033752       -33.22%
BenchmarkSeries_New/[]int(100000)_Int-4            7531595       5063625       -32.77%
BenchmarkSeries_New/[]int(100000)_String-4         19670643      16772118      -14.74%
BenchmarkSeries_New/[]int(100000)_Bool-4           7826898       4548778       -41.88%
BenchmarkSeries_New/[]int(100000)_Float-4          7589070       4962150       -34.61%
BenchmarkSeries_Copy/[]int(100000)_Int-4           3830732       780120        -79.64%
BenchmarkSeries_Copy/[]int(100000)_String-4        5847982       3473473       -40.60%
BenchmarkSeries_Copy/[]int(100000)_Bool-4          1780169       101940        -94.27%
BenchmarkSeries_Copy/[]int(100000)_Float-4         3756647       775425        -79.36%
BenchmarkSeries_Subset/[]int(100000)_Int-4         402341        205062        -49.03%
BenchmarkSeries_Subset/[]int(100000)_String-4      576282        281180        -51.21%
BenchmarkSeries_Subset/[]int(100000)_Bool-4        215667        152118        -29.47%
BenchmarkSeries_Subset/[]int(100000)_Float-4       405079        210860        -47.95%
BenchmarkSeries_Set/[]int(100000)_Int-4            623275        535365        -14.10%
BenchmarkSeries_Set/[]int(100000)_String-4         814689        519182        -36.27%
BenchmarkSeries_Set/[]int(100000)_Bool-4           546948        537051        -1.81%
BenchmarkSeries_Set/[]int(100000)_Float-4          630201        511646        -18.81%

AVG (10 times)	GOTA experimental(6cb756f) (GOLANG v1.7.3) [ns]	(DPLYR) (R v3.3.0) [ns]	Pandas (python v3.4.2) [ns]	Time Ratio (Gota 6cb756f/R)	Time Ratio (Gota 6cb756f/Pandas)
Load Data	3550186270	5177613973.61755	867149114.6087646	0.685679984658941	4.09408971328046
Filtering (1)	60191373	5290218.83010864	4029681.68258667	11.37786071484	14.937004394194
Filtering (2)	76414594	8933675.28915405	3119330.406188965	8.55354504464376	24.4971144603304
Filtering (3)	75358728	13318014.1448975	6498119.831085205	5.65840576381067	11.5970049735779
Arrange (1 col)	686606598	119221401.2146	123551797.86682129	5.75908847744626	5.55723680152437
Arrange (2 col)	358894188	75816893.5775757	67885780.33447266	4.73369681959839	5.28673584116926
Arrange (3 col)	431656827	91589760.7803345	79083681.10656738	4.71293759610607	5.45822881484653
Select (1)	3657266	218020.89214325	1894199.8481750488	16.774841915595	1.93077092869771
Select (2)	7200508	233775.973320007	2974939.3463134766	30.8008898337191	2.42038816990431
Select (3)	9918440	249350.023269653	3899350.166320801	39.7771769576856	2.543613570709
Subset (First 10 rows)	11094	1104371.54769897	143039.22653198242	0.0100455322514557	0.0775591442220185
Subset (First 100 rows)	56498	1128201.48468018	161719.32220458984	0.0500779344533622	0.349358377402329
Subset (First 1000 rows)	498362	1259872.9133606	144600.86822509766	0.395565294495191	3.44646616660841
Set (First 10 rows)	73870225	28951330.1849365	81281099.31945801	2.5515312950434	0.908824137696131
Set (First 100 rows)	69103630	31732535.3622437	81956005.09643555	2.17769015967824	0.843179580540651
Set (First 100 rows)	82565295	29457306.8618774	81333649.15847778	2.80287995732743	1.01514312777387

This has been merged on dev as of 6cb756f

go-gota / gota

Profiling, memory usage and performance review #16