mcaceresb / stata-gtools

Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
https://gtools.readthedocs.io
MIT License
182 stars 38 forks source link

issue specifying data type in gegen group #34

Closed ractingmatrix closed 6 years ago

ractingmatrix commented 6 years ago

Hi, in the following code, egen group and gegen group give different results. egen group seems to automatically resort to a larger data type when the number of groups exceed the limit of the specified data type.

clear
set obs 1000
g x = _n
egen byte group = group(x)
gegen byte group2 = group(x)
mcaceresb commented 6 years ago

Right. I put that in place in case no type was specified, but I suppose if the user wants to specify a different type then I should allow for that. I'll fix it over the weekend. Thanks!

mcaceresb commented 6 years ago

Actually, I had misunderstood this comment. Due to the way the plugin API works, there's no way to do this efficiently, and even then it's not obvious to me what the default behavior should be. For now, I'll just upgrade the type to something that is for sure safe. egen does not always do this, by the way. Consider

clear
set obs `=2^24 + 10'
g long x = _n
egen group = group(x)
format %21.0gc x group
l in `=_N - 10' / `=_N'
qui gdistinct x group
matrix list r(distinct)

And you can see that egen does not adequately type group so there aren't enough levels.

mcaceresb commented 6 years ago

I asked about this inconsistency in egen here, if you want to read more about it. For now, I'll stick to upgrading variable types.

sergiocorreia commented 6 years ago

Hi Mauricio,

Just to be clear, do you plan in upgrading the user-specified type in order to fit the number of levels?

I would argue that we should follow two rules, in order of importance:

  1. As much as possible, upgrade type to avoid losing information. In particular, if group() returns incorrect results, that would be both hard to detect and would create serious problems down the line (e.g. if I get the mean by group, but I used float as default so now I treat two groups as one).
  2. If possible, downgrade type to save memory. So if the user doesn't specify the type of the new variable, and it fits in -byte-, then use byte and not -float- or anything else.

Also, even if you don't like the ideas above, I would argue that the default type for egen group should never be float, because it is always dominated by -long-:

clear all
set obs `=3e7'
gen long i = _n // long supports up to 2bn

gegen long g1 = group(i)
cap noi assert i == g1 // works

gegen g2 = group(i)
cap noi assert i == g2 // fails b/c g2 is float

gegen byte g3 = group(i)
cap noi assert i == g3 // fails b/c g3 is byte
mcaceresb commented 6 years ago

"gegen, group" now forces a type if there might be loss of information. In other words, I follow the first rule but not always the second because, as I mention, the second is not possible to always implement efficiently.

I think you're using an old version of gtools for your example. In gtools 0.12.5 I don't get the issues you mention, and gegen byte g3 = group(i) you should see the message "(warning: user-requested type 'byte' upgraded to 'long')" (unless I didn't upload to github correctly, in which case do let me know).

sergiocorreia commented 6 years ago

Cool! The first rule is the key one, because we can always call compress for the 2nd one!!