nalimilan / FreqTables.jl

Frequency tables in Julia
Other
90 stars 19 forks source link

Timings compared to new grouping code meant for DataFrames #3

Open tshort opened 8 years ago

tshort commented 8 years ago

I was intrigued by the use of ht_keyindex, so I compared timings of freqtable to some new DataFrames code I'm experimenting with (see https://github.com/JuliaStats/DataFrames.jl/issues/894). Feel free to close this issue; I just thought you might like to see the timings. freqtable is allocating quite a bit.

using DataFrames,DataFramesMeta
using FreqTables
n=10_000_000
y=[string("id",i) for i in rand(1:10,n)];
x=rand(1:10,n);
@time pda=PooledDataArray(y,UInt8);
d=DataFrame(x=P(x),y=P(y),pda=pda);

Here are timings from FreqTables:

freqtable(x) : 15.6 secs freqtable(y) : 16.7 secs freqtable(pda) : 1.5 secs freqtable(x,pda) : 26 secs

The equivalent timings from DataFramesMeta:

freqtable(x) : 0.36 secs freqtable(y) : 8.3 secs freqtable(pda) : 0.38 secs freqtable(x,pda) : 2.4 secs

Here is an edited transcript:

julia> using DataFrames,DataFramesMeta

julia> using FreqTables

julia> n=10_000_000
10000000

julia> y=[string("id",i) for i in rand(1:10,n)];

julia> x=rand(1:10,n);

julia> @time pda=PooledDataArray(y,UInt8);
  5.779360 seconds (20.19 M allocations: 401.625 MB, 18.34% gc time)

julia> @time f=freqtable(x)
 15.603074 seconds (120.68 M allocations: 2.264 GB, 21.57% gc time)
10-element NamedArrays.NamedArray{Int64,1,Array{Int64,1},Tuple{Dict{Int64,Int64}}}
1  999111
2  998631
3  1000254
4  1000311
5  1000433
6  1000310
7  1001538
8  999610
9  1000517
10 999285

julia> @time f=freqtable(y)
 16.711632 seconds (130.15 M allocations: 2.391 GB, 17.37% gc time)
10-element NamedArrays.NamedArray{Int64,1,Array{Int64,1},Tuple{Dict{ByteString,Int64}}}
id1  1001500
id10 1001339
id2  998962
id3  1000494
id4  998943
id5  997610
id6  1001378
id7  999609
id8  999185
id9  1000980

julia> @time f=freqtable(pda)
  1.533855 seconds (20.27 M allocations: 317.998 MB, 31.14% gc time)
10-element NamedArrays.NamedArray{Int64,1,Array{Int64,1},Tuple{Dict{ByteString,Int64}}}
id1  1001500
id10 1001339
id2  998962
id3  1000494
id4  998943
id5  997610
id6  1001378
id7  999609
id8  999185
id9  1000980

julia> @time f=freqtable(x,pda)
 26.057288 seconds (170.19 M allocations: 3.886 GB, 11.94% gc time)
10x10 NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{Dict{Int64,Int64},Dict{ByteString,Int64}}}
Dim1 \ Dim2 id1    id10   id2    id3    …  id6    id7    id8    id9
1           100063 100237 99591  99190  …  100529 100125 99834  100022
2           100004 99640  99586  100140 …  100252 99735  100073 99410
3           99753  100040 100119 100105 …  100021 100308 99679  100057
4           100907 99607  100086 99991  …  99869  100125 100177 100274
5           100523 100327 100114 100446 …  100679 99593  99815  99843
6           100028 100773 100367 100116 …  99554  100015 99684  99877
7           100194 100236 99505  100534 …  100135 100268 100320 100515
8           100069 100080 99976  99531  …  99368  99635  100247 100916
9           99941  100095 100321 100263 …  100257 99765  99773  100261
10          100018 100304 99297  100178 …  100714 100040 99583  99805

julia> d=DataFrame(x=P(x),y=P(y),pda=pda);

julia> @time DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,:x),count=length(:pda))
  0.361674 seconds (577 allocations: 85.866 MB, 56.09% gc time)
10x2 DataFrames.DataFrame
| Row | x  | count   |
|-----|----|---------|
| 1   | 1  | 999111  |
| 2   | 2  | 998631  |
| 3   | 3  | 1000254 |
| 4   | 4  | 1000311 |
| 5   | 5  | 1000433 |
| 6   | 6  | 1000310 |
| 7   | 7  | 1001538 |
| 8   | 8  | 999610  |
| 9   | 9  | 1000517 |
| 10  | 10 | 999285  |

julia> @time DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,:y),count=length(:pda))
  8.267301 seconds (20.22 M allocations: 517.192 MB, 42.02% gc time)
10x2 DataFrames.DataFrame
| Row | y      | count   |
|-----|--------|---------|
| 1   | "id1"  | 1001500 |
| 2   | "id10" | 1001339 |
| 3   | "id2"  | 998962  |
| 4   | "id3"  | 1000494 |
| 5   | "id4"  | 998943  |
| 6   | "id5"  | 997610  |
| 7   | "id6"  | 1001378 |
| 8   | "id7"  | 999609  |
| 9   | "id8"  | 999185  |
| 10  | "id9"  | 1000980 |

julia> @time DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,:pda),count=length(:pda))
  0.377314 seconds (580 allocations: 85.866 MB, 59.59% gc time)
10x2 DataFrames.DataFrame
| Row | pda    | count   |
|-----|--------|---------|
| 1   | "id1"  | 1001500 |
| 2   | "id10" | 1001339 |
| 3   | "id2"  | 998962  |
| 4   | "id3"  | 1000494 |
| 5   | "id4"  | 998943  |
| 6   | "id5"  | 997610  |
| 7   | "id6"  | 1001378 |
| 8   | "id7"  | 999609  |
| 9   | "id8"  | 999185  |
| 10  | "id9"  | 1000980 |

julia> @time r=DataFramesMeta.@aggregate(DataFramesMeta.groupby(d,[:x,:pda]),count=length(:y));
  2.363423 seconds (9.95 M allocations: 380.801 MB, 12.60% gc time)

julia> unstack(r,:pda,:count)
10x11 DataFrames.DataFrame
| Row | x  | id1    | id10   | id2    | id3    | id4    | id5    | id6    |
|-----|----|--------|--------|--------|--------|--------|--------|--------|
| 1   | 1  | 100063 | 100237 | 99591  | 99190  | 99871  | 99649  | 100529 |
| 2   | 2  | 100004 | 99640  | 99586  | 100140 | 100149 | 99642  | 100252 |
| 3   | 3  | 99753  | 100040 | 100119 | 100105 | 100163 | 100009 | 100021 |
| 4   | 4  | 100907 | 99607  | 100086 | 99991  | 99778  | 99497  | 99869  |
| 5   | 5  | 100523 | 100327 | 100114 | 100446 | 99635  | 99458  | 100679 |
| 6   | 6  | 100028 | 100773 | 100367 | 100116 | 100118 | 99778  | 99554  |
| 7   | 7  | 100194 | 100236 | 99505  | 100534 | 99528  | 100303 | 100135 |
| 8   | 8  | 100069 | 100080 | 99976  | 99531  | 99618  | 100170 | 99368  |
| 9   | 9  | 99941  | 100095 | 100321 | 100263 | 100006 | 99835  | 100257 |
| 10  | 10 | 100018 | 100304 | 99297  | 100178 | 100077 | 99269  | 100714 |

| Row | id7    | id8    | id9    |
|-----|--------|--------|--------|
| 1   | 100125 | 99834  | 100022 |
| 2   | 99735  | 100073 | 99410  |
| 3   | 100308 | 99679  | 100057 |
| 4   | 100125 | 100177 | 100274 |
| 5   | 99593  | 99815  | 99843  |
| 6   | 100015 | 99684  | 99877  |
| 7   | 100268 | 100320 | 100515 |
| 8   | 99635  | 100247 | 100916 |
| 9   | 99765  | 99773  | 100261 |
| 10  | 100040 | 99583  | 99805  |
tshort commented 8 years ago

Inspired by your use of ht_keyindex, I wrote a constructor for PooledDataArrays that's more than twice as fast as the normal constructor. It does it by doing less dict lookups as your code does. See here for my code:

https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L136-L166

In particular, it helped to separate the code that loops through the vector into its own function. That helped Julia figure out types.

nalimilan commented 8 years ago

Interesting. I find the very slow timings quite surprising, as I remember doing some careful optimization when I wrote the code. I wonder whether I recently introduced a type instability when porting to 0.4, for which I hacked a workaround for the one-dimensional table case. I'll have a deeper look next week or so.

nalimilan commented 8 years ago

I've had a look at this, and indeed my code suffered from type instability. Not sure how I missed that. Anyway, now the timings are much better on git master, and even faster than DataFramesMeta for the PDA case (almost no allocations!). Though some cases are still slower, I need to investigate why.

Anyway, I have some code here to use DataFramesMeta when a DataFrame is passed to freqtable. This limits code duplication a lot, and will make it easier to support any kind of data source (including SQL databases). I'll push this when it's ready.

Times are for a second run, on 0.5. To easy copy/paste, the gist is here: https://gist.github.com/nalimilan/905624dd5f44b4c020d57c16fcaab498

julia> using DataFrames,DataFramesMeta, FreqTables

julia> n=1000_000
1000000

julia> y=ASCIIString[string("id",i) for i in rand(1:10,n)];

julia> x=rand(1:10,n);

julia> @time pda=PooledDataArray(y,UInt8);
  0.445467 seconds (999.53 k allocations: 24.075 MB)

julia> @time f=freqtable(x);
  0.033819 seconds (81 allocations: 5.047 KB)

julia> @time f=freqtable(y);
  0.207490 seconds (2.00 M allocations: 45.783 MB)

julia> @time f=freqtable(pda);
  0.003743 seconds (47 allocations: 3.016 KB)

julia> @time f=freqtable(x, pda);
  2.345581 seconds (4.00 M allocations: 91.574 MB, 48.86% gc time)

julia> d=DataFrame(x=P(x),y=P(y),pda=pda);

julia> @time @by(d, :x, N=length(:x));
  0.268315 seconds (1.01 M allocations: 57.985 MB, 15.64% gc time)

julia> @time @by(d, :y, N=length(:x));
  0.520084 seconds (1.01 M allocations: 57.986 MB, 9.09% gc time)

julia> @time @by(d, :pda, N=length(:x));
  0.077855 seconds (12.55 k allocations: 25.328 MB, 7.45% gc time)

julia> @time @by(d, (:x, :pda), N=length(:x));
  1.034521 seconds (4.01 M allocations: 98.190 MB, 7.37% gc time)

UPDATE: With new fixes for the general case, the timings are now always better than DataFramesMeta, except when crossing a PDA with an array. This use case isn't the most interesting IMHO, though I could try doing something about it.