Open m-elholm opened 1 year ago
@m-elholm I think the more likely explanation is that you've run into the limits of 4-byte floats (see the generating IDs section here). This snippet shows that gunique
is working correctly, and that x
is indeed the problem, which has repeated values:
. clear
. set obs 35000000
Number of observations (_N) was 0, now 35,000,000.
. gen x = _n
. gen long y = _n
. gen double z = _n
. gunique x
N = 35,000,000; 25,527,216 unbalanced groups of sizes 1 to 5
. gunique y
N = 35,000,000; 35,000,000 balanced groups of size 1
. gunique z
N = 35,000,000; 35,000,000 balanced groups of size 1
. format %21.0fc x y z
. l in `=_N-4'/l
+--------------------------------------+
| x y z |
|--------------------------------------|
34999996. | 34,999,996 34,999,996 34,999,996 |
34999997. | 34,999,996 34,999,997 34,999,997 |
34999998. | 35,000,000 34,999,998 34,999,998 |
34999999. | 35,000,000 34,999,999 34,999,999 |
35000000. | 35,000,000 35,000,000 35,000,000 |
+--------------------------------------+
One solution is to type such data `c(obs_t)'
, which contains the smallest data type that can store _n
(and will change as the number of observations in your data changes).
Describe the bug I think there is an error in the gunique and also gegen xx =nunique functions. In a sample of 35 million observations it does not count the number of unique values correctly. When I generate a variable x = _n , there should be 35 mil. unique observations, but it only count it as 25 million,
Version info
gtools
]