mcaceresb / stata-gtools

Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
https://gtools.readthedocs.io
MIT License
182 stars 38 forks source link

Implementing -gegen, rank()- #67

Closed mdroste closed 5 years ago

mdroste commented 5 years ago

Hey Mauricio,

I realized that we can replicate -egen, rank()- with a function that makes calls to -gdistinct-, -gegen, count()-, and -fasterxtile- pretty easily. -egen, rank()- has four different ways to compute ranks depending on the options. The code snippet below replicates three of those, with the exception of -egen, rank() unique-.

* Create simulated data
clear all
set obs 10000000
gen x = ceil(runiform()*10000)
tempfile data
save `data'

*---------------------------------------------
* egen rank
*---------------------------------------------

* Load simulated data
use `data', clear 

* With egen
timer on 1
egen rank_x = rank(x)
timer off 1

* With gtools
timer on 2
tempvar t1 t2 t3
gen `t1' = x
gdistinct x
fasterxtile `t2' = x, nq(`r(N)') 
gegen `t3' = count(x), by(`t1')
gen rank2_x = `t2' + `t3'/2 - 0.5
timer off 2

* Validate
gen same = rank_x==rank2_x
sum

*---------------------------------------------
* egen rank, track
*---------------------------------------------

* Load simulated data
use `data', clear 

* With egen
timer on 3
egen rank_x = rank(x), track
timer off 3

* With gtools 
timer on 4
tempvar t1 t2 t3
gen `t1' = x
gdistinct x
local Nd = r(ndistinct)
fasterxtile `t2' = x, nq(`r(N)')
gen rank2_x = `t2' 
timer off 4

* Validate
gen same = rank_x==rank2_x
sum 

*---------------------------------------------
* egen rank, field
*---------------------------------------------

* Load simulated data
use `data', clear 

* With egen
timer on 5
egen rank_x = rank(x), field
timer off 5

* With gtools
timer on 6
tempvar t1 t2 t3
gen `t1' = x
gdistinct x
local N = r(N)
fasterxtile `t2' = x, nq(`N') 
gegen `t3' = count(x), by(`t1')
gen rank2_x = `N' - `t2' - `t3' + 2
timer off 6

* Validate they produce same results
gen same = rank_x==rank2_x
sum 

*---------------------------------------------
* Display relative speeds
*---------------------------------------------

* Display benchmark speeds
timer list
timer clear

Hope this is useful!

mcaceresb commented 5 years ago

This is a bit faster; no need for gdistinct and fasterxtile, right?

tempvar N
gegen rankTrack = group(x), counts(`N')
gen rankField = `r(N)' - rankTrack - `N' + 2
gen rankDefault = rankTrack + `N' / 2 - 0.5

I'll think about a neat way of adding a function for it.

mdroste commented 5 years ago

That's a neat idea (and way more parsimonious), but I don't think it works if x has ties (i.e. more than one observation for any given value). Notice that rankTrack above runs from 1 to the number of distinct values of x, but it should (approximately, due to ties) run from 1 to N. The rest of ranks then get messed up, since they're derived from rankTrack. Another way of saying this is that the highest ranked value(s) should be given a track rank of 1, and all other values should have a track rank = the number of observations smaller than a given value, but the code above won't account for ties. The code I had somewhat opaquely accounts for ties with use of xtile combined with group. In the case of no ties, the code you have above should work, but a lot of real-world data has ties. But I agree that we could use group(x), counts() to omit the need for gdistinct, though!

mcaceresb commented 5 years ago

Right, I misread the code. I thought gdistinct counted the number of unique values, but it doesn't. Not sure why it's necessary, then. It's just counting non-missing, right? But yes, my code doesn't work as you point out.

mcaceresb commented 5 years ago

Up in develop. Weights are allowed. Try gtools, upgrade branch(develop)

clear all

capture program drop bench
program bench
    gettoken timer call: 0,    p(:)
    gettoken colon call: call, p(:)
    cap timer clear `timer'
    timer on `timer'
    `call'
    timer off `timer'
    qui timer list
    c_local r`timer' `=r(t`timer')'
end

clear
set obs 10000000
gen x = ceil(runiform() * 10000)
gen g = round(_n / 100)

bench 1:   egen double rankx_def1 = rank(x)
bench 2:  gegen double rankx_def2 = rank(x)

bench 3:   egen rankx_track1 = rank(x), track
bench 4:  gegen rankx_track2 = rank(x), ties(track)

bench 5:   egen rankx_field1 = rank(x), field
bench 6:  gegen rankx_field2 = rank(x), ties(field)

bench 7:   egen long rankx_uniq1 = rank(x), uniq
bench 8:  gegen long rankx_uniq2 = rank(x), ties(uniq)

gegen rankx_uniq3 = rank(x), ties(stable)

bench 11:  egen double rankx_group_def1 = rank(x), by(g)
bench 12: gegen double rankx_group_def2 = rank(x), by(g)

bench 13:  egen rankx_group_track1 = rank(x), by(g) track
bench 14: gegen rankx_group_track2 = rank(x), by(g) ties(track)

bench 15:  egen rankx_group_field1 = rank(x), by(g) field
bench 16: gegen rankx_group_field2 = rank(x), by(g) ties(field)

bench 17:  egen long rankx_group_uniq1 = rank(x), by(g) uniq
bench 18: gegen long rankx_group_uniq2 = rank(x), by(g) ties(uniq)

gegen rankx_group_uniq3 = rank(x), by(g) ties(stable)

assert (rankx_def1   == rankx_def2)
assert (rankx_track1 == rankx_track2)
assert (rankx_field1 == rankx_field2)

sort x, stable
assert rankx_uniq3 == _n

gisid rankx_uniq1
gisid rankx_uniq2

assert (rankx_group_def1   == rankx_group_def2)
assert (rankx_group_track1 == rankx_group_track2)
assert (rankx_group_field1 == rankx_group_field2)

cap drop ix
sort g x, stable
by g: gen long ix = _n
assert rankx_group_uniq3 == ix

gisid g rankx_group_uniq1
gisid g rankx_group_uniq2

local bench_table `"     Versus | Native | gtools | % faster "'
local bench_table `"`bench_table'"' _n(1) `" ---------- | ------ | ------ | -------- "'

local commands default track field unique
forvalues i = 1(2)7 {
    gettoken cmd commands: commands
    local pct      "`:disp %7.2f  100 * (`r`i'' - `r`=`i'+1'') / `r`i'''"
    local dnative  "`:disp %6.2f `r`i'''"
    local dgtools  "`:disp %6.2f `r`=`i'+1'''"
    local cmd      `"`:disp %10s "`cmd'"'"'
    local bench_table `"`bench_table'"' _n(1) `" `cmd' | `dnative' | `dgtools' | `pct'% "'
}

local bench_table `"`bench_table'"' _n(1) `" ---------- | ------ | ------ | -------- "'
local bench_table `"`bench_table'"' _n(1) `" by group                                "'
local bench_table `"`bench_table'"' _n(1) `" ---------- | ------ | ------ | -------- "'

local commands default track field unique
forvalues i = 11(2)17 {
    gettoken cmd commands: commands
    local pct      "`:disp %7.2f  100 * (`r`i'' - `r`=`i'+1'') / `r`i'''"
    local dnative  "`:disp %6.2f `r`i'''"
    local dgtools  "`:disp %6.2f `r`=`i'+1'''"
    local cmd      `"`:disp %10s "`cmd'"'"'
    local bench_table `"`bench_table'"' _n(1) `" `cmd' | `dnative' | `dgtools' | `pct'% "'
}
disp _n(1) `"`bench_table'"'

Result

     Versus | Native | gtools | % faster 
 ---------- | ------ | ------ | -------- 
    default |  59.56 |   2.87 |   95.18% 
      track |  57.33 |   2.55 |   95.55% 
      field |  59.04 |   2.50 |   95.76% 
     unique |  56.48 |   2.71 |   95.20% 
 ---------- | ------ | ------ | -------- 
 by group                                
 ---------- | ------ | ------ | -------- 
    default |  37.07 |   2.01 |   94.58% 
      track |  36.67 |   2.11 |   94.24% 
      field |  36.08 |   2.14 |   94.07% 
     unique |  40.80 |   2.37 |   94.18%
mdroste commented 5 years ago

Amazing! Thanks so much, Mauricio.