mcaceresb / stata-gtools

Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
https://gtools.readthedocs.io
MIT License
182 stars 38 forks source link

Wrong number of groups #90

Open m-elholm opened 1 year ago

m-elholm commented 1 year ago

Describe the bug I think there is an error in the gunique and also gegen xx =nunique functions. In a sample of 35 million observations it does not count the number of unique values correctly. When I generate a variable x = _n , there should be 35 mil. unique observations, but it only count it as 25 million,

// code snippet
gen x = _n 
gunique x

Version info

mcaceresb commented 1 year ago

@m-elholm I think the more likely explanation is that you've run into the limits of 4-byte floats (see the generating IDs section here). This snippet shows that gunique is working correctly, and that x is indeed the problem, which has repeated values:

. clear

. set obs 35000000
Number of observations (_N) was 0, now 35,000,000.

. gen x = _n

. gen long y = _n

. gen double z = _n

. gunique x
N = 35,000,000; 25,527,216 unbalanced groups of sizes 1 to 5

. gunique y
N = 35,000,000; 35,000,000 balanced groups of size 1

. gunique z
N = 35,000,000; 35,000,000 balanced groups of size 1

. format %21.0fc x y z

. l in `=_N-4'/l

          +--------------------------------------+
          |          x            y            z |
          |--------------------------------------|
34999996. | 34,999,996   34,999,996   34,999,996 |
34999997. | 34,999,996   34,999,997   34,999,997 |
34999998. | 35,000,000   34,999,998   34,999,998 |
34999999. | 35,000,000   34,999,999   34,999,999 |
35000000. | 35,000,000   35,000,000   35,000,000 |
          +--------------------------------------+

One solution is to type such data `c(obs_t)', which contains the smallest data type that can store _n (and will change as the number of observations in your data changes).