Questions on implementing optimal instruments

jeffgortmaker / pyblp

BLP Demand Estimation with Python

https://pyblp.readthedocs.io

MIT License

239 stars 82 forks source link

Questions on implementing optimal instruments #97

Closed kevindanray closed 3 years ago

kevindanray commented 3 years ago

I noticed in the tutorial when optimal instruments were implemented, the procedure was:

Initial estimate with 1s GMM
Compute optimal instruments
Updated results with 1s GMM

I am curious, could something be gained by using 2s GMM in either of the steps? Also, what would happen (good/bad) if one used "initial_update = True" for the updated results?

jeffgortmaker commented 3 years ago

Asymptotically, I don't think there should be a difference. This relates to discussions about two-step vs. iterated GMM, where there are sometimes finite sample improvements (or not) for more GMM steps.

The main thing inital_update=True does is update the weighting matrix at the specified initial parameter values. So this is fine if you're using your estimates as your second-stage starting values, since you're updating it at your first-step consistent estimate.

But in practice, if you're not approximating the optimal instruments, I recommend using ProblemResults.updated_W. If you are approximating the optimal instruments, I recommend passing optimization=pyblp.Optimization('return') and your consistent first-stage estimates to Problem.solve, and using the resulting ProblemResults.updated_W. This way, you can try out multiple different starting values (e.g., +/- 50% around your first-stage estimates) under the same (efficient) weighting matrix, to verify that you're reaching the same (global) minimum during the second step.

chrisconlon commented 3 years ago

I should also add that the optimal IV problem (with just the demand side) is just-identified. In this case -- the weighting matrix should not be relevant.

With demand+ supply there is an overidentifying restriction and weights might matter -->but any initial consistent estimate of theta should work (as Jeff notes).

We discuss this at some length in the Rand paper ( https://chrisconlon.github.io/site/pyblp.pdf).

On Mon, Aug 23, 2021 at 3:00 PM Jeff Gortmaker @.***> wrote:

Asymptotically, I don't think there should be a difference. This relates to discussions about two-step vs. iterated GMM, where there are sometimes finite sample improvements (or not) for more GMM steps.

The main thing inital_update=True does is update the weighting matrix at the specified initial parameter values. So this is fine if you're using your estimates as your second-stage starting values, since you're updating it at your first-step consistent estimate.

But in practice, if you're not approximating the optimal instruments, I recommend using ProblemResults.updated_W. If you are approximating the optimal instruments, I recommend passing optimization=pyblp.Optimization('return') and your consistent first-stage estimates to Problem.solve, and using the resulting ProblemResults.updated_W. This way, you can try out multiple different starting values (e.g., +/- 50% around your first-stage estimates) under the same (efficient) weighting matrix, to verify that you're reaching the same (global) minimum during the second step.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeffgortmaker/pyblp/issues/97#issuecomment-904029641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7IOWPTYB64PVT3BDLHJB3T6KLEXANCNFSM5CVDCKMQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

kevindanray commented 3 years ago

I had been replying to the emails, which contrary to how it said it should work, has not been added to the thread here...

I've been working through the issues for the last couple days and feel like I fully understand what both of you have said here, so thank you. However, I seem to have identified unexpected behaviors in the program.

Even though the problem is "just identified", it was winding up with objective function values on the order of E+02 and E+03, while my understanding is that the "just identified" problem should be minimized at approximately zero. I did some checking and found this problem is linked (separately) to two distinct features of my model specifications.

In the first case, the omission of the constant term from theta2 in favor of a mutually exhaustive set of dummies causes the optimal instruments problem to behave as if it were overidentified when these dummies are included in x1. Note the minimum objective value achieved is 2.34, significantly greater than machine epsilon.

product_formulations_noconstant = (
   pyblp.Formulation('0 + prices + sugar + C(mushy)'),
   pyblp.Formulation('0 + prices + sugar + C(mushy)')
)

problem_noconstant= pyblp.Problem(product_formulations_noconstant, product_data, agent_formulation, agent_data)

initial_sigma_noc = np.diag([ 2.4526, 0.0163, 0.3302, 0.2441])
initial_pi_noc = [
  [15.8935, -1.2000, 0,       2.6342],
  [-0.2506,  0,      0.0511,  0     ],
  [ 5.4819,  0,      0.2037,  0     ],
  [ 1.2650,  0,     -0.8091,  0     ]
]
results_noconstant = problem_noconstant.solve(
    initial_sigma_noc,
    initial_pi_noc,
    optimization=pyblp.Optimization('bfgs', {'gtol': 1e-5}),
    method='1s'
)
instrument_results_noc = results_noconstant.compute_optimal_instruments(method='approximate')

updated_problem_noc = instrument_results_noc.to_problem()
updated_results_noc = updated_problem_noc.solve(
    results_noconstant.sigma,
    results_noconstant.pi,
    delta = results_noconstant.delta,
    optimization=pyblp.Optimization('bfgs', {'gtol': 1e-5}),
    method='1s'
)

Problem Results Summary:
========================================================================================================
GMM   Objective  Gradient       Hessian         Hessian     Clipped  Weighting Matrix  Covariance Matrix
Step    Value      Norm     Min Eigenvalue  Max Eigenvalue  Shares   Condition Number  Condition Number 
----  ---------  ---------  --------------  --------------  -------  ----------------  -----------------
 1    +2.34E+00  +5.18E-06    +1.97E-04       +2.22E+04        0        +2.93E+10          +1.83E+18    
========================================================================================================

The other feature of my specification which also produces this issue is the inclusion of non-zero off-diagonal elements of Sigma, which I have also successfully replicated with the Nevo pseduo-data.

Do you have any insight on this?

jeffgortmaker commented 3 years ago

One way to rephrase your question is that you're having trouble finding a solution (zero objective) to a nonlinear system of n equations (moments) in n unknowns (parameters). I can think of at least two contributing problems:

If the system were linear, we'd require a standard rank condition for there to exist a solution. But the problem here is nonlinear, so you'd need some generalized rank condition that is going to be difficult to establish.
Even if there exists a solution, you need to find it with a numerical optimizer. The problem here is generally non-convex, so finding a solution can be difficult.

These two problems can also interact. If some rank condition is satisfied, but just barely, this may make it more difficult for an optimizer to find the solution.

It makes sense to me that these problems will start cropping up when you add more random coefficients or add correlations between these coefficients. Intuitively, you're (1) increasing the dimension of your system, and (2) increasing its degree of nonlinearity. Both of these will contribute to the above problems.

I have some suggestions that might help you get a sense of what's going on with your specific problem:

See whether some pair of approximated optimal instruments (or moment conditions in general) are nearly identical. If so, this intuitively suggests that you have nearly n - 1 equations in n unknowns, so it may be difficult (or impossible) to find a solution that sets the objective to zero. (Your updated weighting matrix has a condition number on the order of 1E10, so this may be the case.) Checking which parameters these nearly-collinear instruments correspond to may help give a sense of what's going on.
Try my suggestions above about using different starting values, say, 5 sets +/- 50% around your first-stage estimates. This is standard practice when trying to find the global optimum of a non-convex problem. You may also want to try different optimizers and termination tolerances.
Starting with a problem where you can find a solution, minimally add to the problem to see what's leading to non-existence. For example, you could add just one more column instead of two dummies, and use the same estimated parameter values as before, except for a new one on the extra column.

kevindanray commented 3 years ago

Thank you again for the helpful suggestions. It sent me down a path of investigation that hopefully answers our questions.

First of all, I was reporting results for the Nevo problem...where indeed there is strong multicollinearity as a result of the income_squared term. Dropping that term reduces the weighting matrix condition number and leads to better behavior in general.

However, with both the Nevo data and my data I was continuing to get non-zero objective values with certain formulations. My theory is that certain formulations create constraints such that you essentially have more than N equations. This occurred with certain off-diagonal elements being included in Sigma (in the Nevo pseduo-data, it was specifically interactions of the constant with a continuous variable), and it occurred when mushy[0] and mushy[1] were both included in Pi. For instance, in addition to the normal moment conditions, you would have mushy[0] + mushy[1] = 1 as a constraint on the system, and so it may actually be more like N equations in N-1 unknowns. Put more directly in the system of moments, due to the linkages between the variables, the structural error is not sufficiently free to adjust with theta2 to achieve the zero objective value. It's not as intuitively clear why the constant interacted with continuous variables also leads to the same behavior...my best guess is that there is something like a constraint of "constant = 1", whereas with dummy variable interactions this scale constraint is unnecessary.

I recognize that I am describing constraints on the data and not the parameters, but there should be an analogous constraint on the parameters. Does this sound like a plausible explanation for the observed patterns?

jeffgortmaker commented 3 years ago

Honestly, I'm not sure. It's hard to say much more without a minimum working example (MWE) where adding just one small thing (e.g. a single new variable in X2) and holding everything else constant (parameter estimates, instruments, etc.) results in a consistently (across multiple starting values, etc.) nonzero objective, whereas previously it was zero.

An even better MWE would then drop other parts of the problem (reduce the number of markets, cut out other variables, etc.), until you're left with the smallest possible problem, and the smallest possible tweak to this problem, that delivers a zero -> nonzero objective.

I agree that mushy[0] + mushy[1] = 1 is a constraint on the system, but I don't think it's much different from including a constant in your formulation. So it's not clear to me why this would be violating some rank condition.

kevindanray commented 3 years ago

It's not exactly "minimum", but to the extent necessary I tried to reduce the size of the example with the Nevo data. To replicate the issue, it is necessary to include three terms (1 + prices + mushy), allowing for off-diagonal sigma terms in the first column, an income interaction on the constant, and formulate X1 without product fixed effects. I have tested starting values 50% greater than the first stage results and 50% lower and they all converge to the same solution at an objective value of 6.85e-02. (BFGS tolerance 1e-5). Nelder-Mead reaches the same objective value at convergence.

Changing any one element of this problem seems to resolves the issue:

Introducing product fixed effects (2.24e-14)
Replacing the constant interacted with price in sigma with 0 (1.98e-14)
Replacing the constant interacted with mushy in sigma with 0 (5.92e-15)
Replacing the constant interacted with income in pi with 0 (1.33e-13)


product_formulations_mwe = (
   pyblp.Formulation('1 + prices + sugar + mushy'),
   pyblp.Formulation('1 + prices + mushy')
)
problem_mwe = pyblp.Problem(product_formulations_mwe, product_data, pyblp.Formulation('0 + income'), agent_data)

results_mwe = problem_mwe.solve(
    sigma = [
        [0.3, 0, 0],
        [1,  2.5, 0],
        [1, 0, .01]
    ],
    pi = [
        [1], 
        [0],
        [0]
    ],
    optimization=pyblp.Optimization('bfgs', {'gtol': 1e-5}),
    method='1s'
)

instrument_results_mwe = results_mwe.compute_optimal_instruments(method='approximate')

updated_problem_mwe = instrument_results_mwe.to_problem()

w_mwe = updated_problem_mwe.solve(
    results_mwe.sigma,
    results_mwe.pi,
    delta = results_mwe.delta,

    optimization=pyblp.Optimization('return'),
    method='1s'
)

updated_results_mwe = updated_problem_mwe.solve(
    results_mwe.sigma,
    results_mwe.pi,
    delta = results_mwe.delta,
    W = w_mwe.updated_W,
    optimization=pyblp.Optimization('bfgs', {'gtol': 1e-5}),
    method='1s'
)

jeffgortmaker commented 3 years ago

Thanks for this. I'm not seeing any clear pattern, so my guess is this is a numerical issue, due to some combination of points (1) and (2) above, and probably others that I'm not accounting for.

In particular, one catch-all answer to "my estimator is having numerical issues" is: try increasing the number of simulation draws. Intuitively, this will help "smooth out" any rank condition in point (1), and will make the objective less "choppy," helping the optimizer in point (2).

It turns out that this may in fact be helpful in this particular case. I replaced agent_data with mwe_agent_data when initializing the problem and added the following code before the start of yours:

scale = 1
np.random.seed(0)
mwe_agent_data = pd.concat(scale * [agent_data])
mwe_agent_data['nodes0'] = np.random.normal(size=len(mwe_agent_data))
mwe_agent_data['nodes1'] = np.random.normal(size=len(mwe_agent_data))
mwe_agent_data['nodes2'] = np.random.normal(size=len(mwe_agent_data))
mwe_agent_data['weights'] /= scale

If you change the seed a number of times, you'll find that you often get a nonzero objective, but sometimes get a near-zero one. When I increased scale to something like 10, I found that I got a near-zero objective more often than not.

This suggests to me that asymptotically (in the number of Monte Carlo draws), some rank condition is satisfied, but in finite samples when we only have a few draws, it is sometimes not satisfied. In this problem, the sparsity of mushy and the weakness of instruments are probably contributing factors.

All this said, I don't have great intuition for what precisely is causing the system to not have a solution, or making any solution difficult to find. There's just a lot going on.

kevindanray commented 3 years ago

Independently today, I started playing around with the problem and the integration method, and I think that it is clearly related, but perhaps not as simple as "you need lots of draws" which is what it looks like your solution is doing.

So my new minimum working example is MUCH more minimal. I have removed the off-diagonal elements and the demographic interactions

product_formulations_mwe = (
   pyblp.Formulation('1 + prices + mushy'),
   pyblp.Formulation('1 + prices + mushy')
)
problem_mwe = pyblp.Problem(product_formulations_mwe, product_data, 
                            integration = pyblp.Integration('method', options))

results_mwe = problem_mwe.solve(
    sigma = np.diag([0.3, 2.5, 0.01]),
    optimization=pyblp.Optimization('bfgs', {'gtol': 1e-6}),
    method='1s'
)

instrument_results_mwe = results_mwe.compute_optimal_instruments(method='approximate')

updated_problem_mwe = instrument_results_mwe.to_problem()

w_mwe = updated_problem_mwe.solve(
    results_mwe.sigma,
    delta = results_mwe.delta,
    optimization=pyblp.Optimization('return'),
    method='1s'
)

updated_results_mwe = updated_problem_mwe.solve(
    results_mwe.sigma,
    results_mwe.pi,
    delta = results_mwe.delta,
    W = w_mwe.updated_W,
    optimization=pyblp.Optimization('bfgs', {'gtol': 1e-6}),
    method='1s'
)

What I found fascinating is that the updated results achieve the near-zero objective with Monte Carlo (500 draws and 2500 draws, seed 110), Product, and Grid/Nested Grid. But with Halton, LHS, and MLHS, it does not...even with 5000 draws. Even more odd, when I tested Monte Carlo with the 5000 draws, it ceased to achieve the near-zero objective. I tested it again with 1500 and still did not achieve near-zero...and then I tested MC with 500 draws and seed 10, and again it did not achieve near-zero.

I've run a number of tests along these lines, and I find the following:

Monte Carlo is hit or miss, even with large numbers of draws. But when it fails to achieve near-zero results for one seed number, it will work with a different seed.
Halton and MLHS did not achieve near-zero values for any number of draws or seeds.
Quadrature rules produce near-zero objective values, but also near-zero random coefficients. Not sure why there is such a large discrepancy in results.

jeffgortmaker commented 3 years ago

Interesting! Honestly I don't have the time right now to play too much with this problem, but I did notice that sometimes the first-stage optimizer fails to converge, and that weighting matrix condition numbers can be somewhat large.

So it might be prudent to fix the approximated optimal instruments at some converged first-stage that generates a non-large weighting matrix condition number (indicating that the instruments aren't nearly co-linear), and then play around with the integration configuration during the second stage, to see if this impacts anything. The idea is to just hold as many things fixed as possible.

kevindanray commented 3 years ago

Using my actual research data, I observed much the same phenomenon, although for a low number of draws (250 and 300) I was able to get Halton draws to produce near-zero objective values. But so long as things work as expected with Monte Carlo draws, I believe the gains from the approximation to the optimal instruments outweigh the efficiency loss of moving from quasi-random draws to pseudo-random draws. Worst case scenario, you just need to throw more draws at the problem and let it run for a while to convince the reviewers/editors.

In order to play around with the integration during the second stage only, I would need to be able to pass in agent_data or a new integration, which does not appear to be allowed with the .to_problem() function. I could be missing something in the documentation though.

chrisconlon commented 3 years ago

Sorry -- but I lost track of this thread. I think the setup is:

We estimate demand-only to get $\hat{theta}$
We compute optimal IV using the approximation $E[p | z]$ and $\hat{\theta}$

The issue is that with the optimal IV we have K moment restrictions and K unknown parameters and we say that the system is "just identified", therefore why is it that we can't find a GMM objective value that is close to zero?

I think the answer is quite simple -- which is that E[ xi_jt | z_jt]=0 only if the model is correctly specified. In a misspecified model the conditional moment restriction need not hold, and the unconditional moments derived from the conditional moment restriction (including the "optimal IV") need to be satisfied either.

Put more simply, without the correct fixed effects, and nonlinear parameters, why would one expect the Nevo problem to satisfy the moment restrictions? You're getting a nonzero objective and "rejecting the model" which is exactly what should be happening here!

kevindanray commented 3 years ago

Chris,

I wanted to dig into your hypothesis that the problem is failing due to the specification changes which I used to create the minimum working example. I tried to do a little bit of digging here today by taking the original Nevo agent data, dropping the integration nodes, and generating new ones. I borrowed from Jeff's code to scale the data up and get more draws as well.

To be clear, I have the full product formulation from the original problem, and the only change I have made to the agent formulation is the removal of the income_squared term, which you have established works . Once again, the optimal instruments model produces objective function values around 4.3 with Halton and MLHS, but near zero with Monte Carlo.

The issue seems to be tied to the Integration methods. I generated Sobol and Halton draws in R and loaded them into Python, and with just 20 draws as in the original, for Sobol draws the approximation to the optimal instruments model converged to a near-zero objective value, with Halton achieving about 1E-2. Consistent with Jeff's scaling up to 10, I went to 200 draws, and in this case Sobol reached 1.6E-2 and Halton reached 2.2E+1. Scaling up to 50, the problem persists for both Sobol and Halton draws. But importing MC draws from R, the objective function reaches near zero again.

Now, with the MWE I was able to just use the standard pyblp.Integration method in the problem so that suggests the issue lies there. However, when I am merging it with the demographic data, I have been using this code to build the integration which I then merge with the demographic draws. If I have made an error here, let me know:

def gen_int_nodes (df, N, nodes):
    # Extract all market ids
    markets = df.market_ids.unique()

    # Create blank dictionary to populate with draws for each market
    d = {}

    # Loop over markets creating a table of integration nodes
    for num, m in enumerate(markets, start =1):
        int_nodes = pd.DataFrame( 
            pyblp.data_to_dict(
            pyblp.build_integration(
            pyblp.Integration('monte_carlo', size = N, specification_options={'seed': num}), nodes)))
                #'seed':num ensures that you do not get the exact same set of draws for each market
        int_nodes.insert(0,'market_ids', m)
        int_nodes['draw_num'] = int_nodes.index
        d[m] = int_nodes

    nodes = pd.concat(d.values())
    return(nodes)

kevindanray commented 3 years ago

Jeff, I finally figured out that if I used the optimization = 'return' then I could use one set of integration draws with the instruments computed off of the results from another set of integration draws. I proceeded to test the behavior/performance of the optimal instrument objective value in several different ways. I have attached the notebook with the initial configuration settings.

First, I tried changing the seed for the MC and Halton draws, and this revealed significant instability. The first run had seed:num, the second run had seed:num+5, and the third run had seed:num+(num*num). The rows represent the results used to compute the optimal instruments, while the columns represent the integration draws used. Across three different seeds, there is not a single cell that is consistently near-zero aside from the original data which is not dependent on the seed. However, it is clear that some combination of the results from the original data or the draws from the original data are frequently associated with achieving the near-zero objective value.

| Original | MC | Halton -- | -- | -- | -- Original Draws | 3.42E-14 | 9.57E-02 | 8.23E-01 MC Draws | 2.55E-15 | 2.55E-14 | 8.30E+00 Halton Draws | 3.59E-14 | 5.55E-01 | 1.54E-01 | | | | Original | MC | Halton Original Draws | 3.42E-14 | 2.82E-16 | 3.90E-15 MC Draws | 3.88E-14 | 3.12E-14 | 1.63E+00 Halton Draws | 4.96E-01 | 6.68E-01 | 5.79E-02 | | | | Original | MC | Halton Original Draws | 3.42E-14 | 1.13E-15 | 7.34E-17 MC Draws | 3.59E-01 | 1.90E+00 | 8.94E+00 Halton Draws | 1.60E+00 | 2.04E-15 | 7.62E+00

Increasing the number of draws to 1000 (scale = 50), every new result/integration node combination failed to approach zero. Unfortunately I don't think this clears up what is happening, but it does expand the toolkit for identifying the issue.

I also attempted to replicate the issue with the simulation methods, and was able to do so...however, I lost track of which combination of configurations produced the non-zero objective values and have not been able to recreate it since.

Optimal Instruments Troubleshooting.pdf

jeffgortmaker commented 3 years ago

Right, so when working with small number of draws/nodes (for a four-dimensional integral, a few hundred is quite small, especially when there are correlations), I'd expect a good amount of instability.

I'm not sure how to interpret optimal IVs being approximated with one set of draws and the problem being solved with another set. Asymptotically this shouldn't matter, but here the number of draws is again not too large.

Again, I'm not even sure that failing to solve a nonlinear system of n equations in n unknowns is an "issue" for the reasons discussed above. This is particularly the case when we haven't thought much about the instruments being used for the problem, and where identification is coming from in the first place with this specific dataset.

kevindanray commented 3 years ago

I agree, there's not much to go on, and it may not be a problem. I think my takeaway from this is that when the zero objective value is not obtained, there is cause for using 2s GMM even with the just-identified demand-side only problem using optimal instruments.