jeffgortmaker / pyblp

BLP Demand Estimation with Python
https://pyblp.readthedocs.io
MIT License
228 stars 82 forks source link

Simulated Counterfactual Calculations Error #119

Closed llmafrica closed 1 year ago

llmafrica commented 1 year ago

Hi Jeff,

I have problem results for a random coefficients logit module and measures for CS, Markups, and MC for 4 different "market types". I am attempting to simulate a counterfactual where, for example, parameters from market types 2, 3, and 4 are used with data from market type 1 in order to compare overall measures of CS, Markups, and MC. I have the following code for one one of my calculations:

market1dat = data.loc[data['market']=='market1'
market1agents = agent_data.loc[agent_data['market']=='market1'
simulation = pyblp.Simulation(
    product_formulations=(
        pyblp.Formulation('1 + p+x1+x2'),
        pyblp.Formulation('0+p'),
        pyblp.Formulation('1 + x1 + x2+ x3')
    ),
   beta=[float(res['market2'].beta[0]) ... etc,
   sigma=float(res['market2'].sigma[0]),
   gamma=[float(res['market2'].gamma[0]) ... etc, 
   product_data=market1dat,
   agent_data = market1agents  
) 

If I understand correctly, I then use simulation.replace_endogenous to get calculate the new equilibrium prices and shares given market 1 data with market 2 parameters. And then calculate CS, Markups, and MC via the SimulationResults methods.

However, when I do the above I get the following error:

The fixed point computation of synthetic prices failed to converge. This problem can sometimes be mitigated by increasing the maximum number of fixed point iterations, increasing the fixed point tolerance, configuring other iteration settings, or making sure the specified parameters are reasonable. For example, the parameters on prices should generally imply a downward sloping demand curve.
Encountered a numerical error when computing synthetic prices. This problem is often due to prior problems or overflow and can sometimes be mitigated by making sure that the specified parameters are reasonable. For example, the parameters on prices should generally imply a downward sloping demand curve. Errors encountered: divide by zero, invalid value.

I've checked that the demand curve is downward sloping in prices. I've tried setting error_behavior = 'warn', however the results produced seem unreasonable (CS = 2.0e+6, is a lower estimate for example.) I've altered the tolerance, but only seem to get semi-reasonable results if the tolerance is 1e-1. I have increased the maximum evaluations and changed the starting values to be observed prices in market 1 data, but continue to get this error or unreasonable results.

Am I missing something here? Do you have any suggestions for other things to try out?

jeffgortmaker commented 1 year ago

I think it'd help to first get a sense of precisely where the division by zero is happening. Can you set pyblp.options.verbose_tracebacks = True and see if that gives us a sense of where the issue is coming from? Setting universal_display=True in your pyblp.Integration configuration also helps to figure out at which point during iteration you're getting the error.

The next step will typically be defining a custom integration configuration (there's documentation for that at the bottom of the integration docs link above) so that you can see exactly what's happening.

llmafrica commented 1 year ago

Sorry, the full traceback is enormous and I'm having trouble interpreting it. But, If I understand correctly, this should be the important part and the spot where the error is occurring. So, disabling the profit Hessian calculations might help, is that right?

File ... \lib\site-packages\pyblp\economies\simulation.py:695, in Simulation.replace_endogenous(self, costs, prices, iteration, constant_costs, compute_gradients, compute_hessians, error_behavior)
    692                 errors.append(exceptions.ProfitHessianEigenvaluesError(profit_hessian))
    694 # structure the results
--> 695 self._handle_errors(errors, error_behavior)
    696 results = SimulationResults(
    697     self, data_override, true_delta, true_costs, start_time, time.time(), iteration_stats, profit_gradients,
    698     profit_gradient_norms, profit_hessians, profit_hessian_eigenvalues
    699 )
    700 output(f"Replaced prices and shares after {format_seconds(results.computation_time)}.")

File  ... \lib\site-packages\pyblp\economies\economy.py:233, in Economy._handle_errors(errors, error_behavior)
    231 if errors:
    232     if error_behavior == 'raise':
--> 233         raise exceptions.MultipleErrors(errors)
    234     output("")
    235     output(exceptions.MultipleErrors(errors))
jeffgortmaker commented 1 year ago

No, this is just pinpointing the part of the code that collects error objects (created earlier) and raises them as an exception. It just happens to be after line 692 that has to do with profit Hessians -- nothing related to your issue.

What you're looking for is the part of the traceback that comes just above "Errors encountered: divide by zero, invalid value." For example, if I manually edit the pyblp code to divide by zero when solving for equilibrium prices by adding a updated_x /= 0 after line 788 in markets/market.py, this is the bottom of my traceback with pyblp.options.verbose_tracebacks = True on:

  File "...\pyblp\configurations\iteration.py", line 273, in contraction_wrapper
    values, weights, jacobian = contraction(values, iterations, evaluations)
  File "...\pyblp\markets\market.py", line 789, in contraction
    updated_x /= 0
  File "...\pyblp\utilities\basics.py", line 661, in __call__
    self.detected = self.error()
  File "...\pyblp\utilities\basics.py", line 565, in __init__
    super().__init__()
  File "...\pyblp\utilities\basics.py", line 498, in __init__
    self.stack = ''.join(traceback.format_stack())

 Errors encountered: divide by zero.

You should also try setting universal_display=True to see if you're getting the error in the first iteration or in some subsequent iteration.

llmafrica commented 1 year ago

I see. Thank you for clarifying. I'm working in VSCode with the IPython Notebook, so there is a rather large traceback reported, but I think this is the relevant bit if I'm reading it correctly:

MultipleErrors: The fixed point computation of synthetic prices failed to converge...Traceback:

  File "...\pyblp\economies\simulation.py", line 663, in replace_endogenous
    for t, (prices_t, shares_t, delta_t, costs_t, stats_t, gradients_t, hessians_t, errors_t) in generator:
  File "...\pyblp\utilities\basics.py", line 269, in output_progress
    for index, iterated in enumerate(iterable):
  File "...\pyblp\utilities\basics.py", line 134, in <genexpr>
    return (generate_items_worker((k, factory(k), method)) for k in keys)
  File "...\pyblp\utilities\basics.py", line 147, in generate_items_worker
    return key, method(instance, *method_args)
  File "...\pyblp\markets\simulation_market.py", line 27, in compute_endogenous
    prices, stats, price_errors = self.safely_compute_equilibrium_prices(costs, iteration, constant_costs, prices)
  File "...\pyblp\utilities\basics.py", line 639, in wrapper
    returned = decorated(*args, **kwargs)
  File "...\pyblp\markets\simulation_market.py", line 82, in safely_compute_equilibrium_prices
    errors.append(exceptions.SyntheticPricesConvergenceError())
  File "...\pyblp\utilities\basics.py", line 498, in __init__
    self.stack = ''.join(traceback.format_stack())

And then

Encountered a numerical error when computing synthetic prices...Traceback:

File "...\pyblp\economies\simulation.py", line 663, in replace_endogenous
    for t, (prices_t, shares_t, delta_t, costs_t, stats_t, gradients_t, hessians_t, errors_t) in generator:
  File "...\pyblp\utilities\basics.py", line 269, in output_progress
    for index, iterated in enumerate(iterable):
  File "...\pyblp\utilities\basics.py", line 134, in <genexpr>
    return (generate_items_worker((k, factory(k), method)) for k in keys)
  File "...\pyblp\utilities\basics.py", line 147, in generate_items_worker
    return key, method(instance, *method_args)
  File "...\pyblp\markets\simulation_market.py", line 27, in compute_endogenous
    prices, stats, price_errors = self.safely_compute_equilibrium_prices(costs, iteration, constant_costs, prices)
  File "...\pyblp\utilities\basics.py", line 639, in wrapper
    returned = decorated(*args, **kwargs)
  File "...\pyblp\markets\simulation_market.py", line 80, in safely_compute_equilibrium_prices
    prices, stats = self.compute_equilibrium_prices(costs, iteration, constant_costs, prices)
  File "...\pyblp\markets\market.py", line 828, in compute_equilibrium_prices
    prices, stats = iteration._iterate(prices, contraction)
  File "...\pyblp\configurations\iteration.py", line 284, in _iterate
    raw_final, converged = self._iterator(
  File "...\pyblp\configurations\iteration.py", line 368, in simple_iterator
    x0, (x, weights) = x, contraction(x)[:2]
  File "...\pyblp\configurations\iteration.py", line 273, in contraction_wrapper
    values, weights, jacobian = contraction(values, iterations, evaluations)
  File "...\pyblp\markets\market.py", line 780, in contraction
    capital_lamda_inv = np.diag(1 / capital_lamda_diagonal)
  File "...\pyblp\utilities\basics.py", line 661, in __call__
    self.detected = self.error()
  File "...\pyblp\utilities\basics.py", line 565, in __init__
    super().__init__()
  File "...\pyblp\utilities\basics.py", line 498, in __init__
    self.stack = ''.join(traceback.format_stack())

 Errors encountered: divide by zero, invalid value.

I'm currently fighting with VSCode to get universal_display=True to work so I can see where in the iteration things fall apart, but hopefully the above is helpful.

jeffgortmaker commented 1 year ago

Great! So the issue is with inverting the diagonal Lambda matrix, which is a function of choice probabilities s_ijt and how utility changes with price, dU_ijt / dp_jt. It looks like one or more elements are zero, or close to zero.

In theory, if dU_ijt / dp_jt < 0 for all ijt (i.e. demand is always downward sloping), then Lambda is invertible because s_ijt > 0 for all ijt. In your specification demand may be upward sloping for some consumer types i with a particularly strong unobserved preference for price. Aside: this undesirable feature is one reason why people often prefer a lognormal random coefficient on price.

But some dU_ijt / dp_jt > 0's canceling with the negative ones seems like an unlikely edge-case. More likely is that some particularly disliked product j has choice probabilities s_ijt that are near zero or underflow to zero. It would be helpful to confirm this by figuring out at what prices you get the error, and then checking what choice probabilities look like at these prices (e.g. by setting prices equal to those prices, using iteration=pyblp.Iteration('return'), and inspecting choice probabilities computed with SimulationResults.compute_probabilities.

If you're currently using the default iteration configuration for Simulation.replace_endogenous, you should be able to get a universal display by passing iteration=pyblp.Iteration('simple', {'atol': 1e-12}, universal_display=True).

On my end, I just pushed a possible "fix" in dddd140f575f5f1fe49e1025182006e98d02b2a1. After inverting Lambda, I just replace invalid values (including the infinity you're getting from division by zero) with a large number. Previously, the way I had things coded up replaced invalid values in the next iteration's equilibrium prices with their last values, which was probably too extreme. There's no perfect solution here since we're in "numerical instability land," but hopefully this will allow iteration to continue a bit, while still letting you know that it's running into division by zero issues. But let me know how it looks. (You can use the latest dev code by cloning this GitHub repo and adding it to your PYTHONPATH environment variable.)

llmafrica commented 1 year ago

Maybe this is a silly question, but should I be seeing the output with universal_display = True in the terminal? I recall running this previously and seeing a table that gave the market and the current number of iterations as well as a few other columns. However now there doesn't appear to be any change in what is displayed whether this is True or False. All I am seeing is the traceback.

jeffgortmaker commented 1 year ago

Yes you should see iteration-by-iteration information with universal_display=True in the integration configuration you pass to Simulation.replace_endogenous. Try disabling verbose tracebacks when doing this, since they will clog up the display.

llmafrica commented 1 year ago

That's not the case for me anymore. Even with verbose tracebacks turned off, the universal display is only giving me the the error message. I'll have to get back to you on your other suggestions after I figure this out.

jeffgortmaker commented 1 year ago

Sounds good. And huh strange. If you're using notebooks you may need to refresh some older code to get the universal display flag to work. But if you think it's a bug that the universal display isn't showing up, I'm also happy to look at the full code itself (or even better a minimal working example).

llmafrica commented 1 year ago

Regarding the universal display issue: I was getting interference from an early command that turned off outputs. Thanks for the advice.

Second, I think I successfully utilized the fix you mentioned above, but it unfortunately doesn't seem to have changed much. Traceback looks like it is the same as above, but I'm unsure if that would be expected or if there was an error on my part in implementing your fix.

I utilized the error_behavior = "warn" to check the choice probabilities at those prices and there are multiple instances of values that underflow to zero or are otherwise quite small (2.0e-19 and some smaller).

jeffgortmaker commented 1 year ago

So if you implemented the possible fix (by uninstalling PyBLP installed with pip, pulling the most recent dev code, and adding it to your PYTHONPATH), you should still be seeing the "divide by zero" error, but I think you hopefully shouldn't be seeing the "invalid value" error any more? This is because the code now reverts any invalid values created by division by zero. Let me know if that's the case?

If so, I'll have to think about some other fix that helps keep iteration going.

Until then, two other suggestions for you:

  1. You're going to get underflow of Lambda when all choice probabilities for a product are less than around 1e-300. You should try to figure out what it is about these products that's causing this. Maybe there's an issue there.

  2. If the issue is just that during iteration these products are getting really high prices, you may want to consider trying different starting values for prices (that doesn't lead to underflow), or implementing a custom iteration routine that avoids very large prices, perhaps by manually bounding values that prices can take on.

llmafrica commented 1 year ago

Yes, okay. I did implement correctly then as now it's only a "divide by zero" error. Thank you you for the suggestions. Hopefully one of them will help sort out the problem.

jeffgortmaker commented 1 year ago

Great, let me know.

llmafrica commented 1 year ago

I really appreciate all of the advice thus far, but I had a few more questions moving forward.

First, should I expect the iteration routine to give negative values for the simulated prices? With the standard iteration options provided, in addition to some really huge prices, I am getting a handful of simulated prices < 0. Is that indicative of something wrong?

Second, do you have any resources for how to design a custom iteration routine? This is something that I'm not very familiar with.

jeffgortmaker commented 1 year ago

If you expect prices to be positive in your setting, then it could be an indication of something wrong. But (1) nothing guarantees positive prices, and the level of prices themselves may not be particularly meaningful since you've normalized the utility of the outside option to zero and may have fixed effects, (2) I'd also take a look at marginal costs to see if they seem reasonable in your setting -- sometimes there's a good reason for them to be negative, for example for loss leading goods, and (3) negative values could also be the results of sampling error -- I'd consider doing a bootstrap procedure to get confidence intervals for your counterfactual.

You should take a look at the iteration docs. At the bottom there's a link to a tutorial that provides an example of how to design a custom iteration routine. You're just iterating on a vector.

llmafrica commented 1 year ago

I've bootstrapped my estimates for CS, MC, markups, and elasticity. Most of the estimates look reasonable, with MC for two of my market types appearing negative. The confidence intervals for elasticity look reasonable, but for many estimates the bounds are pretty big, with a few as big as -2.1e+12. Although, this is only with 100 draws due to limitations on my machine. I am not sure what you mean with the last line of your first paragraph. None of the counterfactuals converge. Do you mean to use error_behavior = 'warn' for bootstrapping the counterfactual?

Secondly, I am pretty lost with setting up the iteration. I'm fairly new to python coding and some of callables I am unfamiliar with. Do you have or know of any resources that I could use to read up on contraction and callback()? I attempted to use your example provided in the iteration tutorial, but I kept getting missing positional arguments errors for these two.

jeffgortmaker commented 1 year ago

I'd expect you to get numerical errors once in a while when doing a bootstrap -- so it's often important to look into what about those particular draws led to large numbers. But ultimately standard practice is to just look at the middle 95% of draws for your confidence interval -- this will hopefully drop draws that lead to numerical problems. Hopefully there won't be so many that it'll bias your confidence intervals too much.

The first thing I'd do is try to get your counterfactuals to converge. (This will probably involve following the above advice, perhaps inspecting stuff iteration-by-iteration with a custom iteration configuration.) After that, my suggestion in the last line of my first paragraph is basically to follow the note in the blue box on the bootstrap page, where you'd loop over individual bootstrap draws, and for each, re-create your Simulation and re-run your counterfactual. Does that make sense?

The official Python docs for functions is here. It's difficult for me to help more than just pointing you to this -- not sure why you're getting missing positional arguments without a minimal working example that generates your errors!

jeffgortmaker commented 1 year ago

I'm going to close this for now, but please feel free to re-open or keep commenting if you make more progress or have more questions.

llmafrica commented 1 year ago

Hi Jeff,

I am still working on the above problem. After going back and fixing some data issues I tried everything above, except for bounding the iteration routine.

I copied your code from the tutorial to try and understand what was going on so that I could properly write my own. I realized that I was going wrong when trying to implement your custom routine because I wasn't providing all of the options, after I provided the following, everything seemed to work out (at least with a toy example):

sim.replace_endogenous(iteration = pyblp.Iteration(custom_method, {'norm': np.linalg.norm, 'max_evaluations': 500, 'tol': 8e-14}))

However, I'm still not understanding a few things. In your iteration tutorial, as well as within your code for the simple_iterator and squarem_iterator functions in iteration.py I see contraction(x), but I am having trouble understanding what this does or where I can find any more information about it. Is this information available in the functions docs you provided above? Does this matter for coding my own routine?

jeffgortmaker commented 1 year ago

Right, so contraction is a function object that is passed to your custom fixed point iteration routine. For computing equilibrium prices, it accepts the current values of prices x, and returns a tuple (explained in the iteration docs) where the most important object is the first one, the "next" values of equilibrium prices. When this vector of next prices are very close to your current vector, you'd want to terminate iteration by returning a (final, converged) tuple from your custom function, where final is the final vector of equilibrium prices, and converged would be True, indicating that you converged.

The actual contraction is referenced in the replace_endogenous docs (https://pyblp.readthedocs.io/en/stable/_api/pyblp.Simulation.replace_endogenous.html). We use Morrow-Skerlos's "zeta-markup" contraction to solve for equilibrium prices. There's more info in background (https://pyblp.readthedocs.io/en/stable/background.html#equation-zeta-contraction) and our paper if you're interested, but the important thing is that iterating on it tends to work well. It doesn't particularly matter how it works for coding your own routine.

llmafrica commented 1 year ago

Hi Jeff,

Thanks for the explanation and all of the help thus far. You've been a great help in my trying to get a grasp on what is going on in my problem.

I asked how contraction works because I was hoping to figure out a way to control "where" the function looks for prices. However, from what I have looked through it doesn't look like the function supports any kinds of methods for supplying bounds. I have a custom iteration method now that adds the following to your example code for simple iteration.

      redo = 0
      while True:
            if np.amax(x) <5000 and np.amin(x) >-5000:
                redo = 0
                break
            else:
                x0, (x, weights, _) = x, contraction(x)
                if redo ==1000:
                    break

The issue is that this only really extends the number of total iterations. Is it possible to supply the contraction function with bounds?

jeffgortmaker commented 1 year ago

I don't have any references for how to add bounds to essentially a multidimensional root-finding algorithm, and you're right that SciPy's methods don't seem to support bounds.

One option would be to "reset" x if it goes outside the bounds by re-drawing it uniformly from within your bounds. Then hope that the contraction converges from this new point. This would really be equivalent to just computing equilibrium prices many times at different starting points, which is something that I recommend anyways when solving a nonlinear optimization problem with potentially many local optima.

All this said, my advice is still to first try to figure out exactly why your counterfactuals aren't converging. So try to get the simplest possible example where you get convergence, and when you add one extra thing (eg one more parameter), you no longer get convergence. It's hard to give advice about how to deal with convergence issues when we don't know what the issue is in the first place.

llmafrica commented 1 year ago

I see, well thank you anyway. I'll see if I can figure something out.

Thank you so much! You were immensely helpful throughout all of this and I really appreciate your time and patience.

llmafrica commented 1 year ago

Hi Jeff,

I've been looking more closely into using log-normal random coefficients like you suggested some time ago. I experimented with them and by doing so I was able to get all of my markets to converge in counterfactual simulations. I want to make sure that I am doing this correctly.

My problem formulations are as follows:

x1 = pyblp.Formulation('1 +y1+y2 +y3')
x2 = pyblp.Formulation('0+I(-prices+subsidy)')
x3 = pyblp.Formulation('1 + y1 +y2 +y3+ y4 +y5')
agent_formulation = pyblp.Formulation('1')
pform = (x1,x2, x3)

Here the subsidy is exogenous and only changes the prices that the consumers pay, but the firms still receive the full price. Because x2 is formulated like the above with negative prices, then I should initialize pi such that initial_pi > 0 and if necessary bound pi from below by 0, correct?

Additionally, for interpreting output (for example) is this log(sigma) = 0.01 and log(pi) = 0.84?

Nonlinear Coefficient Estimates (Robust SEs in Parentheses):
==============================================================================================
Types:  |         Sigma:         subsidy - prices    |          Pi:                 1       
------  |  --------------------  --------------------  |  --------------------  --------------
 Log    |  subsidy - prices          +1.00000E-02      |  subsidy - prices         +8.40000E-01 
        |                           (+1.90039E-02)     |                        (+7.54019E-02)
==============================================================================================
jeffgortmaker commented 1 year ago

Great, glad they're convering.

Your formulations look right, but I think your interpretation of the parameters is a bit off. What enters into utility here is

u_ij = ... + (subsidy_j - p_j) * exp(pi + sigma * nu_i) + ...

So you should interpret the output as the parameters pi and sigma of the lognormal distribution of alpha_i,

log alpha_i ~ N(pi, sigma^2) if nu_i ~ N(0, 1)

Note that the mean parameter pi can be negative or positive, and you'll still have alpha_i > 0, i.e. downward sloping demand (which is probably what's helping your counterfactuals converge). So you probably don't want to bound it from below.