Closed jorpppp closed 6 years ago
This is by design. fcollapse
is smarter than gcollapse
in assigning small variable types while preventing overflow. I do not think this is feasible for me, however, because I have to create the typed variable before doing any operations. This is the smartest I could be:
int
as the target type for N up to 327.long
as target for N up to 65,592.These are relatively small Ns, so I didn't think to check. Past that, I would risk overflows because I don't know the result until after I compute the sum. An int can hold integers up to 2^15-28
, and past 65,592
obs a long
would overflow.
Not automatic, but perhaps faster than compress: I could add an option so that instead of assuming the worst possible theoretical case I check the range (or maybe even the sum, since I'd be looping over the variable anyway) and then decide the type based on that. Good compromise? It would still involve some overhead but maybe not as much. Let me know.
This should definitely be faster that -compress-, but it just occurred to me that if the user has a sense of the format the variable should have after collapsing, then -recast- may be a faster option than -compress-. I don't know if this would be faster or slower than this compromise solution.
On Tue, Oct 23, 2018 at 6:15 PM Mauricio Caceres Bravo < notifications@github.com> wrote:
Not smart but perhaps faster than compress, I could add an option so that instead of assuming the worst possible theoretical case I check the range (or maybe even the sum, since I'd be looping over the variable anyway) and then decide the type based on that. Good compromise? It would still involve some overhead but maybe not as much. Let me know.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mcaceresb/stata-gtools/issues/44#issuecomment-432454989, or mute the thread https://github.com/notifications/unsubscribe-auth/AQNMUAs7zkgywlktaGSQU89iJGB3H2TTks5un6MkgaJpZM4X2y6- .
-- Jorge Pérez Pérez Ph.D. in Economics, Brown University. www.jorgeperezperez.com
This is in the develop
branch; will hit master when I get a chance to run the tests. You should be able to try it out via
gtools, upgrade branch(develop)
Add option sumcheck
to gcollapse
Great, thanks.
On Sat, Oct 27, 2018 at 10:03 PM Mauricio Caceres Bravo < notifications@github.com> wrote:
This is in the develop branch; will hit master when I get a chance to run the tests. You should be able to test it via
gtools, upgrade branch(develop)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mcaceresb/stata-gtools/issues/44#issuecomment-433672328, or mute the thread https://github.com/notifications/unsubscribe-auth/AQNMUFd7G-2QCWSM-HDw18gevjbf0_iTks5upR6OgaJpZM4X2y6- .
-- Jorge Pérez Pérez Ph.D. in Economics, Brown University. www.jorgeperezperez.com
Note to self: This is still a problem with weights, where a sum might overflow but it is not detected.
Native Stata -collapse- does this too
sysuse auto, clear des rep collapse (sum) rep, by(foreign) des rep compress
-fcollapse- keeps integers as is
sysuse auto, clear des rep fcollapse (sum) rep, by(foreign) des rep
-gcollapse- replicates the behavior of the native Stata -collapse-
sysuse auto, clear des rep gcollapse (sum) rep, by(foreign) des rep compress
On large datasets, the extra time taken by -compress- at the end to convert the variables back to int may reduce the speed gains of -gcollapse- vs -fcollapse-. Is there anyway to replicate the behavior of -fcollapse- in this setting?