jeffgortmaker / pyblp

BLP Demand Estimation with Python
https://pyblp.readthedocs.io
MIT License
228 stars 82 forks source link

Problems replicating Best Practices results #126

Closed AvyG closed 1 year ago

AvyG commented 1 year ago

Hi,

I want to start using the package so I started by trying to replicate the results from your paper, but I am having some troubles obtaining the same results, specifically for the best practices part.

I applied the methods mentioned in the paper and the tutorial and I can obtain the results of the replication column for the fake cereal and automobile data, but not for the best practices column.

Since I am new to the method I am not sure if I am skipping some steps, or doing something wrong. Here is a repo (https://github.com/AvyG/BLP-replicationproject) with my notebook. I am not sure what I am doing wrong.

Thanks in advance!

jeffgortmaker commented 1 year ago

Great, thanks for doing this!

For the Nevo problem, you should be able to get the same numbers as the original paper if you use the same starting values initial_sigma and initial_pi that you used for the first step, before computing optimal instruments (right now you're using the estimates from the first stage). That said, I actually prefer your numbers, since they achieve the essentially zero objective that we'd expect under a just-identified model with demand-only optimal instruments. We probably should have noticed this in the original paper.

For the BLP problem, our best practices column also includes using 10,000 scrambled Halton draws in each market, while in your code you're using the same importance sampling draws that allow you to get the same results as the replication column. We didn't include these Halton draws in the package because they'd take up a lot of space, but you should be able to construct them with something like this:

market_ids = []
weights = []
nodes = []
income = []
for index, (_, row) in enumerate(means.iterrows()):
    integration = pyblp.Integration('halton', 10_000, {
        'discard': 1000 + index * 10_000,
        'seed': index
    })
    untransformed_agents = pyblp.build_integration(integration, 6)
    market_ids.append(np.repeat(1900 + int(row['year']), untransformed_agents.weights.size))
    weights.append(untransformed_agents.weights)
    nodes.append(untransformed_agents.nodes[:, :-1])
    income.append(np.exp(row['mean'] + float(sd) * untransformed_agents.nodes[:, -1]))

blp_agent_data = {
    'market_ids': np.concatenate(market_ids),
    'weights': np.concatenate(weights),
    'nodes': np.vstack(nodes),
    'income': np.concatenate(income),
}

Here means is a dataframe with income means by year:

year,mean
71,2.011560000000
72,2.065260000000
73,2.078430000000
74,2.057750000000
75,2.029150000000
76,2.053460000000
77,2.067450000000
78,2.098050000000
79,2.104040000000
80,2.072080000000
81,2.060190000000
82,2.065610000000
83,2.076720000000
84,2.104370000000
85,2.126080000000
86,2.164260000000
87,2.180710000000
88,2.188560000000
89,2.212500000000
90,2.183770000000

and sd is a common standard deviation value of 1.72.

The other thing is for that final column we computed optimal instruments after only one step, not two. So you'll need to compute optimal instruments from the first step of the replication column, not the second, in order to get numbers that exactly match.

Some general advice that I've built up over the years, which I've been meaning to put into a tutorial, is:

  1. Always use something like 3-5 random starting values (say, within bounds on your parameters) to be sure that you're consistently getting to the same global optimum. You can parallelize this across jobs on a grid if optimization takes a long time.
  2. Keep increasing the number of draws per market until you do consistently get to the same estimates across runs; otherwise the objective is probably too "choppy."
  3. The standard operating procedure is: (1) first GMM step with 3-5 starting values, (2) compute optimal weighting matrix and optimal IVs if desired, (3) second GMM step with another random 3-5 starting values.
  4. These days I generally recommend SciPy's trust-constr algorithm for optimization. Or Knitro if you have it.

Hope this helps.

AvyG commented 1 year ago

Hi,

Thank for your quick and detailed response. With your advice, I could replicate the results of the fake cereal data. I am still working with the automobile data.

Thank you so much!

jeffgortmaker commented 1 year ago

Great! I'm going to close this issue for now, but please feel free to re-open / keep commenting if you have trouble with the automobile data, or anything else.