Closed ractingmatrix closed 6 years ago
Right. I put that in place in case no type was specified, but I suppose if the user wants to specify a different type then I should allow for that. I'll fix it over the weekend. Thanks!
Actually, I had misunderstood this comment. Due to the way the plugin API works, there's no way to do this efficiently, and even then it's not obvious to me what the default behavior should be. For now, I'll just upgrade the type to something that is for sure safe. egen
does not always do this, by the way. Consider
clear
set obs `=2^24 + 10'
g long x = _n
egen group = group(x)
format %21.0gc x group
l in `=_N - 10' / `=_N'
qui gdistinct x group
matrix list r(distinct)
And you can see that egen
does not adequately type group
so there aren't enough levels.
I asked about this inconsistency in egen here, if you want to read more about it. For now, I'll stick to upgrading variable types.
Hi Mauricio,
Just to be clear, do you plan in upgrading the user-specified type in order to fit the number of levels?
I would argue that we should follow two rules, in order of importance:
group()
returns incorrect results, that would be both hard to detect and would create serious problems down the line (e.g. if I get the mean by group, but I used float as default so now I treat two groups as one).Also, even if you don't like the ideas above, I would argue that the default type for egen group should never be float, because it is always dominated by -long-:
clear all
set obs `=3e7'
gen long i = _n // long supports up to 2bn
gegen long g1 = group(i)
cap noi assert i == g1 // works
gegen g2 = group(i)
cap noi assert i == g2 // fails b/c g2 is float
gegen byte g3 = group(i)
cap noi assert i == g3 // fails b/c g3 is byte
"gegen, group" now forces a type if there might be loss of information. In other words, I follow the first rule but not always the second because, as I mention, the second is not possible to always implement efficiently.
I think you're using an old version of gtools for your example. In gtools 0.12.5 I don't get the issues you mention, and gegen byte g3 = group(i)
you should see the message "(warning: user-requested type 'byte' upgraded to 'long')" (unless I didn't upload to github correctly, in which case do let me know).
Cool! The first rule is the key one, because we can always call compress for the 2nd one!!
Hi, in the following code, egen group and gegen group give different results. egen group seems to automatically resort to a larger data type when the number of groups exceed the limit of the specified data type.