jeffgortmaker / pyblp

BLP Demand Estimation with Python
https://pyblp.readthedocs.io
MIT License
228 stars 82 forks source link

Memory issues with pyblp.differentiation_instruments #141

Closed jakob-ditfurth closed 12 months ago

jakob-ditfurth commented 1 year ago

Hi Jeff,

I am running into an issue with memory usage of pyblp.differentitation_instruments. The dataset is about 40GB, when I try to create pyblp.differentiation_insrtuments, the memory usage goes up to 500GB and then crashed (I only allocated 500GB) because it needed more.

Is there a rule of thumb for how much memory I need given a datasets' size? Or, would you suggest splitting the dataset and running an array job over the markets?

Best, Jakob

jeffgortmaker commented 1 year ago

Differentiation instruments are functions of J_t x J_t matrices, where J_t is the number of products. If J_t is large for any market t, this will use a lot of memory.

The function isn't very complicated, so you should just take a look at how it's done. It currently does it market-by-market, but does compute instruments for all characteristics at once, and I believe holds multiple J_t x J_t matrices in memory at a time for each characteristic.

There's probably a way to create your own custom function based on the above that only holds a single (or two) J_t x J_t matrices in memory at a time, probably at the cost of more CPU. In this case, the rule of thumb is how much memory it takes to store a J_t x J_t matrix. With 64-bit precision, each float takes 8 bytes. Since 500GB is 536,870,912,000 bytes, you can store 67,108,864,000 numbers, the square root of which is around 259,054 products maximum per market, if I did my math right.

In general, I can't think of many instances in which you want that many products per market, so if you write your own low-memory code, you should be fine with 500GB.

jakob-ditfurth commented 1 year ago

Hi Jeff,

Thanks for your quick response. That is odd, I double checked and the largest market is 12,000 products. I wonder if it had anything to do with reticulate from R. I don't have the same problem when I run the same code on the same data on python.

I will try and test some more, but in any case, I can run it on python.

Thanks! Jakob

jeffgortmaker commented 1 year ago

To be clear, the rule of thumb above is for the “best case” scenario in which only one big matrix is stored at a time. The code currently doesn’t do this, so I would expect it to do much worse in terms of memory (not CPU) than this “best case.”

But if you’re not getting the problem with raw Python then yeah, something to do with reticulate I guess. If you figure out the source of the problem let me know!

jeffgortmaker commented 12 months ago

I'm going to close this issue for now, but please do let me know if you learn anything more about what's going on with reticulate.