Open tshort opened 8 years ago
Inspired by your use of ht_keyindex
, I wrote a constructor for PooledDataArrays that's more than twice as fast as the normal constructor. It does it by doing less dict lookups as your code does. See here for my code:
https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L136-L166
In particular, it helped to separate the code that loops through the vector into its own function. That helped Julia figure out types.
Interesting. I find the very slow timings quite surprising, as I remember doing some careful optimization when I wrote the code. I wonder whether I recently introduced a type instability when porting to 0.4, for which I hacked a workaround for the one-dimensional table case. I'll have a deeper look next week or so.
I've had a look at this, and indeed my code suffered from type instability. Not sure how I missed that. Anyway, now the timings are much better on git master, and even faster than DataFramesMeta for the PDA case (almost no allocations!). Though some cases are still slower, I need to investigate why.
Anyway, I have some code here to use DataFramesMeta when a DataFrame
is passed to freqtable
. This limits code duplication a lot, and will make it easier to support any kind of data source (including SQL databases). I'll push this when it's ready.
Times are for a second run, on 0.5. To easy copy/paste, the gist is here: https://gist.github.com/nalimilan/905624dd5f44b4c020d57c16fcaab498
julia> using DataFrames,DataFramesMeta, FreqTables
julia> n=1000_000
1000000
julia> y=ASCIIString[string("id",i) for i in rand(1:10,n)];
julia> x=rand(1:10,n);
julia> @time pda=PooledDataArray(y,UInt8);
0.445467 seconds (999.53 k allocations: 24.075 MB)
julia> @time f=freqtable(x);
0.033819 seconds (81 allocations: 5.047 KB)
julia> @time f=freqtable(y);
0.207490 seconds (2.00 M allocations: 45.783 MB)
julia> @time f=freqtable(pda);
0.003743 seconds (47 allocations: 3.016 KB)
julia> @time f=freqtable(x, pda);
2.345581 seconds (4.00 M allocations: 91.574 MB, 48.86% gc time)
julia> d=DataFrame(x=P(x),y=P(y),pda=pda);
julia> @time @by(d, :x, N=length(:x));
0.268315 seconds (1.01 M allocations: 57.985 MB, 15.64% gc time)
julia> @time @by(d, :y, N=length(:x));
0.520084 seconds (1.01 M allocations: 57.986 MB, 9.09% gc time)
julia> @time @by(d, :pda, N=length(:x));
0.077855 seconds (12.55 k allocations: 25.328 MB, 7.45% gc time)
julia> @time @by(d, (:x, :pda), N=length(:x));
1.034521 seconds (4.01 M allocations: 98.190 MB, 7.37% gc time)
UPDATE: With new fixes for the general case, the timings are now always better than DataFramesMeta, except when crossing a PDA with an array. This use case isn't the most interesting IMHO, though I could try doing something about it.
I was intrigued by the use of
ht_keyindex
, so I compared timings offreqtable
to some new DataFrames code I'm experimenting with (see https://github.com/JuliaStats/DataFrames.jl/issues/894). Feel free to close this issue; I just thought you might like to see the timings.freqtable
is allocating quite a bit.Here are timings from FreqTables:
freqtable(x)
: 15.6 secsfreqtable(y)
: 16.7 secsfreqtable(pda)
: 1.5 secsfreqtable(x,pda)
: 26 secsThe equivalent timings from DataFramesMeta:
freqtable(x)
: 0.36 secsfreqtable(y)
: 8.3 secsfreqtable(pda)
: 0.38 secsfreqtable(x,pda)
: 2.4 secsHere is an edited transcript: