Product-specific demographics and micro-moments

JohannaJoy commented 1 year ago

Hi, thank you so much for this amazing resource! I had two doubts on the use of product-specific demographics and micro-moments (in the development version). I think I figured out how it works, but I would like to suggest to add this to the documentation or be corrected if I misunderstood something.

1) In the API documentation on agent_data - product-specific demographics, it says "For markets with fewer products than this maximum number, latter columns will be ignored.". If I "fill" these latter columns with NaN or NA in the corresponding markets, I get the error message Each demographic must either be a single column or have a column for each of the maximum of [...] products. There is at least one missing demographic for product index [...]. If I fill them with numerical values, those seem to be indeed ignored (at least changing the value by a lot does not affect my results). Changing the code such that missing values are allowed in the markets where products are unavailable would be the ideal, to still get the above error message in case the missing values were put in the wrong column. But if this is tricky, it would be great to simply mention it in the documentation.

2) It could be helpful to add information on how micro-moments need to be adapted when product-specific demographics are included, for example in the micro-moments tutorial. Assume, for example, I add "distance" as a product-specific demographic in the agent_data and agent_formulation of your Micromoments-Tutorial. I then need to adapt the compute-values-function for non-product-specific demographics as follows (here for agent_mi_part):

    compute_values=lambda t, p, a: np.outer(np.take(a.demographics[:, 5],0,1), np.r_[0, p.X2[:, 7]]),

since the inclusion of product-specific demographics turns a.demographics[:, 5] into a ($I_t$ x $J_t$) matrix in each market t, where the same column of demographics is repeated $Jt$ times. But if I want to include a micro-moment for the product-specific demographics, for example $E[distance{ij} * 1\{j>0\}]$, I need all $J_t$ columns of a.demographics[:, 9]. The following seems to work:

distance_inside_part = pyblp.MicroPart(
    name="E[distance_ij * 1{j>0}]", 
    dataset=micro_dataset, 
    compute_values=lambda t, p, a: np.multiply(np.column_stack((np.repeat(0, len(a.demographics[:,9])),a.demographics[:,9])),np.r_[0, p.X2[:, 0]]),
    )

But please let me know if you think something else might be better given the inner workings of the package (I have the doubt since I still have some convergence issues when setting initial_pi non-zero for demographics for which they should be identified by those micro-moments. But these issues are more likely due to other issues that I still investigate.)

jeffgortmaker commented 1 year ago

Thanks for this!

I'm surprised that you're getting that error instead of Variables in agent_data that contribute to demographics should not have NaNs or infinities. Getting the latter error is an unexpected side-effect of a recent fix to try to raise issues with NaNs etc in arrays. I've just reverted it for product-specific demographics in df7e67c01537f51b58421054e36a892773f41e8e. With the dev code, if you're still getting an error when using NaNs with product-specific demographics, mind posting a minimum working example with the full traceback? Would be really helpful.
I agree! Documentation is really lacking there. I added a warning and note following your advice in 974ae41d09d65c5f948f629ddbf3a8a1d1aa75fd. Let me know if you'd add anything else -- also happy to just accept pull requests if you want to add anything to the docs yourself that you think would be helpful. The note is about using numpy.einsum, which I've found is an easy way to define micro moment weights/values when dealing with multidimensional arrays. When I find the time, I'll try to add examples of how to do this in the micro moments tutorial.

JohannaJoy commented 1 year ago

Thanks for your quick response!

I still get the same error, despite having pulled all commits. Here is a minimum working example and below the traceback:

import pyblp
import numpy as np
import pandas as pd

product_data = pd.read_csv(pyblp.data.PETRIN_PRODUCTS_LOCATION)
agent_data = pd.read_csv(pyblp.data.PETRIN_AGENTS_LOCATION)

# keep only subset to make error checking easier
stop_yr = 1985
product_data = product_data[product_data['market_ids']<stop_yr]
agent_data = agent_data[agent_data['market_ids']<stop_yr]
product_data = product_data[product_data['clustering_ids']<20]

# create the product-specific demographic
for t in range(1981,stop_yr):
  I_t     = agent_data.loc[agent_data.market_ids==t].shape[0]
  prods_t = np.sort(product_data.loc[product_data.market_ids==t,'clustering_ids'])
  for p in range(0,prods_t.shape[0]):
    agent_data.loc[agent_data.market_ids==t,"distance" + str(p)]= np.random.lognormal(size=I_t) 

product_formulations = (
  pyblp.Formulation('1 + prices'),
  pyblp.Formulation('1')
  )
agent_formulation = pyblp.Formulation('1 + distance')

problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data)

Traceback

```python-traceback Traceback (most recent call last): File "C:\Users\johan\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3629, in get_loc return self._engine.get_loc(casted_key) File "pandas\_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'distance' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\configurations\formulation.py", line 180, in _build_matrix data_mapping[name] = np.asarray(data[name]).flatten() File "C:\Users\johan\anaconda3\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__ indexer = self.columns.get_loc(key) File "C:\Users\johan\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3631, in get_loc raise KeyError(key) from err KeyError: 'distance' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\primitives.py", line 509, in build_demographics demographics, demographics_formulations, _ = agent_formulation._build_matrix(data) File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\configurations\formulation.py", line 194, in _build_matrix raise patsy.PatsyError(message, origin) from exception PatsyError: Failed to load data for 'distance' because of the above exception. 1 + distance ^^^^^^^^^^^^ During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\primitives.py", line 515, in build_demographics demographics_j, demographics_formulations, _ = agent_formulation._build_matrix( File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\configurations\formulation.py", line 234, in _build_matrix matrix = build_matrix(matrix_design, data_mapping) File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\configurations\formulation.py", line 474, in build_matrix matrix = patsy.build.build_design_matrices([design], data, NA_action='raise')[0].base File "C:\Users\johan\anaconda3\lib\site-packages\patsy\build.py", line 921, in build_design_matrices new_values = NA_action.handle_NA(values, is_NAs, origins) File "C:\Users\johan\anaconda3\lib\site-packages\patsy\missing.py", line 163, in handle_NA return self._handle_NA_raise(values, is_NAs, origins) File "C:\Users\johan\anaconda3\lib\site-packages\patsy\missing.py", line 172, in _handle_NA_raise raise PatsyError("factor contains missing values", origin) PatsyError: factor contains missing values 1 + distance ^^^^^^^^ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users\johan\AppData\Local\Temp\ipykernel_50316\2496078814.py", line 27, in problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data) File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\economies\problem.py", line 1528, in __init__ agents = Agents(products, agent_formulation, agent_data, integration) File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\primitives.py", line 294, in __new__ demographics, demographics_formulations = build_demographics(products, agent_data, agent_formulation) File "C:\Users\johan\anaconda3\lib\site-packages\pyblp-0.13.0-py3.9.egg\pyblp\primitives.py", line 525, in build_demographics raise ValueError(message) from exception_j ValueError: Each demographic must either be a single column or have a column for each of the maximum of 20 products. There is at least one missing demographic for product index 5. ```

Thanks for adding the warning and the hint about numpy.einsum. I think that's sufficient for the API documentation.

jeffgortmaker commented 1 year ago

Thanks for minimum working example. The issue is with the underlying patsy package raising an error with NAs (see the exception from which the lowest exception was raised).

Let me know if the fix I just pushed works -- I just turn off that behavior for product-specific demographics. Seems to let me initialize the problem in your example.

JohannaJoy commented 1 year ago

Yes, it works. Thanks!

jeffgortmaker / pyblp

Product-specific demographics and micro-moments #132

Traceback