Closed JohannaJoy closed 1 year ago
Thanks for this!
I'm surprised that you're getting that error instead of Variables in agent_data that contribute to demographics should not have NaNs or infinities.
Getting the latter error is an unexpected side-effect of a recent fix to try to raise issues with NaNs etc in arrays. I've just reverted it for product-specific demographics in df7e67c01537f51b58421054e36a892773f41e8e. With the dev code, if you're still getting an error when using NaNs with product-specific demographics, mind posting a minimum working example with the full traceback? Would be really helpful.
I agree! Documentation is really lacking there. I added a warning and note following your advice in 974ae41d09d65c5f948f629ddbf3a8a1d1aa75fd. Let me know if you'd add anything else -- also happy to just accept pull requests if you want to add anything to the docs yourself that you think would be helpful. The note is about using numpy.einsum
, which I've found is an easy way to define micro moment weights/values when dealing with multidimensional arrays. When I find the time, I'll try to add examples of how to do this in the micro moments tutorial.
Thanks for your quick response!
import pyblp
import numpy as np
import pandas as pd
product_data = pd.read_csv(pyblp.data.PETRIN_PRODUCTS_LOCATION)
agent_data = pd.read_csv(pyblp.data.PETRIN_AGENTS_LOCATION)
# keep only subset to make error checking easier
stop_yr = 1985
product_data = product_data[product_data['market_ids']<stop_yr]
agent_data = agent_data[agent_data['market_ids']<stop_yr]
product_data = product_data[product_data['clustering_ids']<20]
# create the product-specific demographic
for t in range(1981,stop_yr):
I_t = agent_data.loc[agent_data.market_ids==t].shape[0]
prods_t = np.sort(product_data.loc[product_data.market_ids==t,'clustering_ids'])
for p in range(0,prods_t.shape[0]):
agent_data.loc[agent_data.market_ids==t,"distance" + str(p)]= np.random.lognormal(size=I_t)
product_formulations = (
pyblp.Formulation('1 + prices'),
pyblp.Formulation('1')
)
agent_formulation = pyblp.Formulation('1 + distance')
problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data)
numpy.einsum
. I think that's sufficient for the API documentation.Thanks for minimum working example. The issue is with the underlying patsy package raising an error with NAs (see the exception from which the lowest exception was raised).
Let me know if the fix I just pushed works -- I just turn off that behavior for product-specific demographics. Seems to let me initialize the problem in your example.
Yes, it works. Thanks!
Hi, thank you so much for this amazing resource! I had two doubts on the use of product-specific demographics and micro-moments (in the development version). I think I figured out how it works, but I would like to suggest to add this to the documentation or be corrected if I misunderstood something.
1) In the API documentation on agent_data - product-specific demographics, it says "For markets with fewer products than this maximum number, latter columns will be ignored.". If I "fill" these latter columns with
NaN
orNA
in the corresponding markets, I get the error messageEach demographic must either be a single column or have a column for each of the maximum of [...] products. There is at least one missing demographic for product index [...]
. If I fill them with numerical values, those seem to be indeed ignored (at least changing the value by a lot does not affect my results). Changing the code such that missing values are allowed in the markets where products are unavailable would be the ideal, to still get the above error message in case the missing values were put in the wrong column. But if this is tricky, it would be great to simply mention it in the documentation.2) It could be helpful to add information on how micro-moments need to be adapted when product-specific demographics are included, for example in the micro-moments tutorial. Assume, for example, I add "distance" as a product-specific demographic in the agent_data and agent_formulation of your Micromoments-Tutorial. I then need to adapt the compute-values-function for non-product-specific demographics as follows (here for agent_mi_part):
since the inclusion of product-specific demographics turns
a.demographics[:, 5]
into a ($I_t$ x $J_t$) matrix in each market t, where the same column of demographics is repeated $Jt$ times. But if I want to include a micro-moment for the product-specific demographics, for example $E[distance{ij} * 1\{j>0\}]$, I need all $J_t$ columns ofa.demographics[:, 9]
. The following seems to work:But please let me know if you think something else might be better given the inner workings of the package (I have the doubt since I still have some convergence issues when setting
initial_pi
non-zero for demographics for which they should be identified by those micro-moments. But these issues are more likely due to other issues that I still investigate.)