mcaceresb / stata-gtools

Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
https://gtools.readthedocs.io
MIT License
182 stars 38 forks source link

-gcollapse- recasts ints to doubles #44

Closed jorpppp closed 6 years ago

jorpppp commented 6 years ago

Native Stata -collapse- does this too

sysuse auto, clear des rep collapse (sum) rep, by(foreign) des rep compress

-fcollapse- keeps integers as is

sysuse auto, clear des rep fcollapse (sum) rep, by(foreign) des rep

-gcollapse- replicates the behavior of the native Stata -collapse-

sysuse auto, clear des rep gcollapse (sum) rep, by(foreign) des rep compress

On large datasets, the extra time taken by -compress- at the end to convert the variables back to int may reduce the speed gains of -gcollapse- vs -fcollapse-. Is there anyway to replicate the behavior of -fcollapse- in this setting?

mcaceresb commented 6 years ago

This is by design. fcollapse is smarter than gcollapse in assigning small variable types while preventing overflow. I do not think this is feasible for me, however, because I have to create the typed variable before doing any operations. This is the smartest I could be:

These are relatively small Ns, so I didn't think to check. Past that, I would risk overflows because I don't know the result until after I compute the sum. An int can hold integers up to 2^15-28, and past 65,592 obs a long would overflow.

mcaceresb commented 6 years ago

Not automatic, but perhaps faster than compress: I could add an option so that instead of assuming the worst possible theoretical case I check the range (or maybe even the sum, since I'd be looping over the variable anyway) and then decide the type based on that. Good compromise? It would still involve some overhead but maybe not as much. Let me know.

jorpppp commented 6 years ago

This should definitely be faster that -compress-, but it just occurred to me that if the user has a sense of the format the variable should have after collapsing, then -recast- may be a faster option than -compress-. I don't know if this would be faster or slower than this compromise solution.

On Tue, Oct 23, 2018 at 6:15 PM Mauricio Caceres Bravo < notifications@github.com> wrote:

Not smart but perhaps faster than compress, I could add an option so that instead of assuming the worst possible theoretical case I check the range (or maybe even the sum, since I'd be looping over the variable anyway) and then decide the type based on that. Good compromise? It would still involve some overhead but maybe not as much. Let me know.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mcaceresb/stata-gtools/issues/44#issuecomment-432454989, or mute the thread https://github.com/notifications/unsubscribe-auth/AQNMUAs7zkgywlktaGSQU89iJGB3H2TTks5un6MkgaJpZM4X2y6- .

-- Jorge Pérez Pérez Ph.D. in Economics, Brown University. www.jorgeperezperez.com

mcaceresb commented 6 years ago

This is in the develop branch; will hit master when I get a chance to run the tests. You should be able to try it out via

gtools, upgrade branch(develop)

Add option sumcheck to gcollapse

jorpppp commented 6 years ago

Great, thanks.

On Sat, Oct 27, 2018 at 10:03 PM Mauricio Caceres Bravo < notifications@github.com> wrote:

This is in the develop branch; will hit master when I get a chance to run the tests. You should be able to test it via

gtools, upgrade branch(develop)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mcaceresb/stata-gtools/issues/44#issuecomment-433672328, or mute the thread https://github.com/notifications/unsubscribe-auth/AQNMUFd7G-2QCWSM-HDw18gevjbf0_iTks5upR6OgaJpZM4X2y6- .

-- Jorge Pérez Pérez Ph.D. in Economics, Brown University. www.jorgeperezperez.com

mcaceresb commented 6 years ago

Note to self: This is still a problem with weights, where a sum might overflow but it is not detected.