Closed mdroste closed 5 years ago
This is a bit faster; no need for gdistinct
and fasterxtile
, right?
tempvar N
gegen rankTrack = group(x), counts(`N')
gen rankField = `r(N)' - rankTrack - `N' + 2
gen rankDefault = rankTrack + `N' / 2 - 0.5
I'll think about a neat way of adding a function for it.
That's a neat idea (and way more parsimonious), but I don't think it works if x has ties (i.e. more than one observation for any given value). Notice that rankTrack above runs from 1 to the number of distinct values of x, but it should (approximately, due to ties) run from 1 to N. The rest of ranks then get messed up, since they're derived from rankTrack. Another way of saying this is that the highest ranked value(s) should be given a track rank of 1, and all other values should have a track rank = the number of observations smaller than a given value, but the code above won't account for ties. The code I had somewhat opaquely accounts for ties with use of xtile combined with group. In the case of no ties, the code you have above should work, but a lot of real-world data has ties. But I agree that we could use group(x), counts() to omit the need for gdistinct, though!
Right, I misread the code. I thought gdistinct
counted the number of unique values, but it doesn't. Not sure why it's necessary, then. It's just counting non-missing, right? But yes, my code doesn't work as you point out.
Up in develop. Weights are allowed. Try gtools, upgrade branch(develop)
clear all
capture program drop bench
program bench
gettoken timer call: 0, p(:)
gettoken colon call: call, p(:)
cap timer clear `timer'
timer on `timer'
`call'
timer off `timer'
qui timer list
c_local r`timer' `=r(t`timer')'
end
clear
set obs 10000000
gen x = ceil(runiform() * 10000)
gen g = round(_n / 100)
bench 1: egen double rankx_def1 = rank(x)
bench 2: gegen double rankx_def2 = rank(x)
bench 3: egen rankx_track1 = rank(x), track
bench 4: gegen rankx_track2 = rank(x), ties(track)
bench 5: egen rankx_field1 = rank(x), field
bench 6: gegen rankx_field2 = rank(x), ties(field)
bench 7: egen long rankx_uniq1 = rank(x), uniq
bench 8: gegen long rankx_uniq2 = rank(x), ties(uniq)
gegen rankx_uniq3 = rank(x), ties(stable)
bench 11: egen double rankx_group_def1 = rank(x), by(g)
bench 12: gegen double rankx_group_def2 = rank(x), by(g)
bench 13: egen rankx_group_track1 = rank(x), by(g) track
bench 14: gegen rankx_group_track2 = rank(x), by(g) ties(track)
bench 15: egen rankx_group_field1 = rank(x), by(g) field
bench 16: gegen rankx_group_field2 = rank(x), by(g) ties(field)
bench 17: egen long rankx_group_uniq1 = rank(x), by(g) uniq
bench 18: gegen long rankx_group_uniq2 = rank(x), by(g) ties(uniq)
gegen rankx_group_uniq3 = rank(x), by(g) ties(stable)
assert (rankx_def1 == rankx_def2)
assert (rankx_track1 == rankx_track2)
assert (rankx_field1 == rankx_field2)
sort x, stable
assert rankx_uniq3 == _n
gisid rankx_uniq1
gisid rankx_uniq2
assert (rankx_group_def1 == rankx_group_def2)
assert (rankx_group_track1 == rankx_group_track2)
assert (rankx_group_field1 == rankx_group_field2)
cap drop ix
sort g x, stable
by g: gen long ix = _n
assert rankx_group_uniq3 == ix
gisid g rankx_group_uniq1
gisid g rankx_group_uniq2
local bench_table `" Versus | Native | gtools | % faster "'
local bench_table `"`bench_table'"' _n(1) `" ---------- | ------ | ------ | -------- "'
local commands default track field unique
forvalues i = 1(2)7 {
gettoken cmd commands: commands
local pct "`:disp %7.2f 100 * (`r`i'' - `r`=`i'+1'') / `r`i'''"
local dnative "`:disp %6.2f `r`i'''"
local dgtools "`:disp %6.2f `r`=`i'+1'''"
local cmd `"`:disp %10s "`cmd'"'"'
local bench_table `"`bench_table'"' _n(1) `" `cmd' | `dnative' | `dgtools' | `pct'% "'
}
local bench_table `"`bench_table'"' _n(1) `" ---------- | ------ | ------ | -------- "'
local bench_table `"`bench_table'"' _n(1) `" by group "'
local bench_table `"`bench_table'"' _n(1) `" ---------- | ------ | ------ | -------- "'
local commands default track field unique
forvalues i = 11(2)17 {
gettoken cmd commands: commands
local pct "`:disp %7.2f 100 * (`r`i'' - `r`=`i'+1'') / `r`i'''"
local dnative "`:disp %6.2f `r`i'''"
local dgtools "`:disp %6.2f `r`=`i'+1'''"
local cmd `"`:disp %10s "`cmd'"'"'
local bench_table `"`bench_table'"' _n(1) `" `cmd' | `dnative' | `dgtools' | `pct'% "'
}
disp _n(1) `"`bench_table'"'
Result
Versus | Native | gtools | % faster
---------- | ------ | ------ | --------
default | 59.56 | 2.87 | 95.18%
track | 57.33 | 2.55 | 95.55%
field | 59.04 | 2.50 | 95.76%
unique | 56.48 | 2.71 | 95.20%
---------- | ------ | ------ | --------
by group
---------- | ------ | ------ | --------
default | 37.07 | 2.01 | 94.58%
track | 36.67 | 2.11 | 94.24%
field | 36.08 | 2.14 | 94.07%
unique | 40.80 | 2.37 | 94.18%
Amazing! Thanks so much, Mauricio.
Hey Mauricio,
I realized that we can replicate -egen, rank()- with a function that makes calls to -gdistinct-, -gegen, count()-, and -fasterxtile- pretty easily. -egen, rank()- has four different ways to compute ranks depending on the options. The code snippet below replicates three of those, with the exception of -egen, rank() unique-.
Hope this is useful!