mcaceresb / stata-gtools

Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
https://gtools.readthedocs.io
MIT License
182 stars 38 forks source link

Error with more than 2^31-1 observations #43

Open mcaceresb opened 6 years ago

mcaceresb commented 6 years ago

A bug in Stata causes gtools to exit with error if the user has more than 2^31-1 observations in memory. See this bug report.

I contacted StataCorp about it and they replied:

The SPI can work with datasets containing up to 2^31-1 observations. Our development group is looking into modifying future versions of the SPI to allow more observations.

mcaceresb commented 6 years ago

Re-open because technically it hasn't been fixed. It just throws the correct error now.

miserton commented 2 years ago

Any update on this bug - is it correctable now?

mcaceresb commented 2 years ago

@miserton Not afaik. I would contact StataCorp to ask about updates to this point, since I cannot fix it until they update the SPI. Sorry!

miserton commented 2 years ago

I reached out to StataCorp:

The stplugin.h has been frozen so that we do not make any changes that could break commands that are still using this code. For 64-bit integer support we recommend working with Java or Python code instead of the older 32-bit C plugin limits.

Unless gtools could call the C code through Java or Python, it doesn't look like the 2.1B limitation will be overcome.

mcaceresb commented 2 years ago

@miserton That's a bit of a strange answer since changing stplugin.h in particular would only involve adding a long long type (and in my system long is 64-bit anyway). I always thought the issue were the internals, and that chancing the internal function to take 64-bit input was difficult for some reason. But I don't really know.

In any case, you are right this is unlikely to be resolved any time soon, if ever.

wbuchanan commented 4 months ago

@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types).

mcaceresb commented 4 months ago

@wbuchanan Porting to Java would take quite a bit of effort (partly since I'm not particularly familiar wit Java; not sure if there would be a simple thin wrapper to execute C code in Java but I expect it would still take some work).

The problem is that I cannot modify observations past 2^31-1. The only solution I can think of given the current limitations would be to create multiple separate datasets and append them at the end from Stata.

miserton commented 4 months ago

@miserton can you share who provided that answer to you? @mcaceresb do you have any interest in porting things over to Java or any idea how much different the performance would be in Java compared to C? Also, do you know if C allows overloading function signatures (thinking that might make it even easier since multiple dispatch would prevent breaking any existing code while allowing the same function names to be used with updated types).

I got the answer from Stata technical support (tech-support@stata.com).

wbuchanan commented 4 months ago

@mcaceresb that is what I tried the most recent time and there were still too many observations. There is something called the Java Native Interface, but I've not used that at all. However, even though it's been a while I'm still pretty familiar with the Java API for Stata. The only other thing that might be feasible would be trying to figure out how to implement things in a Cythonic way and using the Python API for the entry point, but I'm not sure how much performance degradation that would cause.

mcaceresb commented 4 months ago

@wbuchanan

that is what I tried the most recent time and there were still too many observations.

Wait, if k is the maximum number of variables that match a given stub, you can reshape long by looping ceil(_N * k / (2^31-1)) chunks of roughly equal size, no?

wbuchanan commented 4 months ago

Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long).

miserton commented 4 months ago

Turns out there was a bug in part of the code that was intended to reduce N by a decent amount. That said, given the size of N and the size of k it would almost definitely be more efficient to port that part of the process to a different language that wouldn't have the restriction on the size of N. (It is a fairly large data set with several hundred variables that need to go from wide to long).

I was grappling with a similar issue when I ran into the maximum number of variables, needing to reshape a very large dataset and greshape was the only thing I thought might work. What I ultimately did was save the data as a CSV and then use some Bash code to write the equivalent of a reshape. I was going from long to wide in the example, but you can do the same thing going from wide to long:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1689382-how-to-identify-duplicate-combinations-of-observations-in-long-form-without-reshaping

mcaceresb commented 4 months ago

@wbuchanan I had a reshape like this: 8 stubs, 144 variables per stub (monthly data over 12 years), a bit under 100M observations. The main thing in my favor was that a TON of the wide data was missing, so the long version was considerably smaller than the theoretical max of ~14B (having to do this is why I added the dropmiss option).

Anyway, I processed it by chunk without much issue, but you're right it should be much slower than doing it at once from another language. This suggests R's data.table is extremely fast, but it also shows pandas might give enough of a speed boost while having an easier Stata interface.

wbuchanan commented 4 months ago

@mcaceresb I think there might be an older version of gtools in the FSRDCs computing environment since there isn't anything mentioning dropmiss in the documentation. That said, by chunking columns and dropping records missing all of the variables to be reshaped manually prior to the reshape I was able to get things to work well. If you wouldn't mind sending the current version of gtools to one of the FSRDC staff I can get you connected; they have an SOP that requires the author of a package to send the source to them.

mcaceresb commented 4 months ago

Feel free to email me and I can send them the code in whatever format they like. @wbuchanan