tlamadon / pytwoway

Two way models in python
MIT License
24 stars 7 forks source link

BLM variance decomposition - overestimated worker effects for low values of real alpha #55

Open MartinHaus1993 opened 9 months ago

MartinHaus1993 commented 9 months ago

Hi there,

I am currently trying to use the BLM estimator and have noticed that I get non-zero values for worker effects where I expect none. I then went back to the BLM example and used this to simulate a range of values for both, alpha and psi. The below code simulates datasets and conducts a variance decomposition. I then plot the results by contrasting the estimated psi and alpha values with the real ones.


import numpy as np
import bipartitepandas as bpd
import pytwoway as tw
from pytwoway import constraints as cons
from matplotlib import pyplot as plt

nl = 3 # Number of worker types
nk = 3 # Number of firm types
blm_params = tw.blm_params({
    'nl': nl, 'nk': nk,
    'a1_mu': 0, 'a1_sig': 1.5, 'a2_mu': 0, 'a2_sig': 1.5,
    's1_low': 0.5, 's1_high': 1.5, 's2_low': 0.5, 's2_high': 1.5, 'verbose': 0
})
cluster_params = bpd.cluster_params({
    'measures': bpd.measures.CDFs(),
    'grouping': bpd.grouping.KMeans(n_clusters=nk),
    'is_sorted': True,
    'copy': False
})
clean_params = bpd.clean_params({
    'drop_returns': 'returners',
    'copy': False
})

real_values_a = []
real_values_psi = []
real_values_cov = []
estimated_a = []
estimated_psi = []
estimated_cov = []
monotonic_mover = []
monotonic_stayer = []

values = np.arange(0,3,0.05)

for a in values:
    for p in values:
        sim_params = bpd.sim_params({
            'nl': nl, 'nk': nk,
            'alpha_sig': a, 'psi_sig': p, 'w_sig': 0.6,
            'c_sort': 0, 'c_netw': 0, 'c_sig': 1
        })

        sim_data = bpd.SimBipartite(sim_params).simulate()

        # Convert into BipartitePandas DataFrame
        sim_data = bpd.BipartiteDataFrame(sim_data)
        # Clean and collapse
        sim_data = sim_data.clean(clean_params).collapse(is_sorted=True, copy=False)
        # Cluster
        sim_data = sim_data.cluster(cluster_params)
        # Convert to event study format
        sim_data = sim_data.to_eventstudy(is_sorted=True, copy=False)

        sim_data_observed = sim_data.loc[:, ['i', 'j1', 'j2', 'y1', 'y2', 't11', 't12', 't21', 't22', 'g1', 'g2', 'w1', 'w2', 'm']]
        sim_data_unobserved = sim_data.loc[:, ['alpha1', 'alpha2', 'k1', 'k2', 'l1', 'l2', 'psi1', 'psi2']]

        movers = sim_data_observed.get_worker_m(is_sorted=True)
        jdata = sim_data_observed.loc[movers, :]
        sdata = sim_data_observed.loc[~movers, :]

        # Initialize BLM estimator
        blm_fit = tw.BLMEstimator(blm_params)
        # Fit BLM estimator
        blm_fit.fit(jdata=jdata, sdata=sdata, n_init=40, n_best=5, ncore=16)
        # Store best model
        blm_model = blm_fit.model

        liks1 = blm_model.liks1

        print('Log-likelihoods monotonic (movers):', np.min(np.diff(liks1)) >= 0)
        monotonic_mover = monotonic_mover + [np.min(np.diff(liks1)) >= 0]
        liks0 = blm_model.liks0
        print('Log-likelihoods monotonic (stayers):', np.min(np.diff(liks0)) >= 0)
        monotonic_stayer = monotonic_stayer + [np.min(np.diff(liks0)) >= 0]

        # Initialize BLM variance decomposition estimator
        blm_fit = tw.BLMVarianceDecomposition(blm_params)
        # Fit BLM estimator
        blm_fit.fit(
            jdata=jdata, sdata=sdata,
            blm_model=blm_model,
            n_samples=20,
            Q_var=[
                        tw.Q.VarAlpha(),
                        tw.Q.VarPsi(),
                        tw.Q.VarPsiPlusAlpha()
                    ],
            ncore=16
        )

        for k, v in blm_fit.res.items():
            print(f'{k!r}: {v.mean()}')

        real_values_a = real_values_a + [np.var(sim_data_unobserved['alpha1'])]
        real_values_psi = real_values_psi + [np.var(sim_data_unobserved['psi1'])]
        real_values_cov = real_values_cov + [np.cov(sim_data_unobserved['psi1'], sim_data_unobserved['alpha1'], ddof=0)[0, 1]]
        estimated_a = estimated_a + [np.mean(blm_fit.res['var(alpha)'])]
        estimated_psi = estimated_psi + [np.mean(blm_fit.res['var(psi)'])]
        estimated_cov = estimated_cov + [np.mean(blm_fit.res['cov(psi, alpha)'])]

points = plt.scatter(real_values_psi, estimated_psi, c=real_values_a)
cbar = plt.colorbar(points)
cbar.set_label("real alpha")
plt.xlabel("Real")
plt.ylabel("Estimated")
plt.title("Psi")
plt.plot([0, np.max([np.max(real_values_psi), np.max(estimated_psi)])], [0, np.max([np.max(real_values_psi), np.max(estimated_psi)])])
plt.savefig("./Figures/Jan24/BLM_psi_plot.png")
plt.show()

points = plt.scatter(real_values_a, estimated_a, c=real_values_psi)
cbar = plt.colorbar(points)
cbar.set_label("real psi")
plt.xlabel("Real")
plt.ylabel("Estimated")
plt.title("Alpha")
plt.plot([0, np.max([np.max(real_values_a), np.max(estimated_a)])], [0, np.max([np.max(real_values_a), np.max(estimated_a)])])
plt.savefig("./Figures/Jan24/BLM_alpha_plot.png")
plt.show()

points = plt.scatter(real_values_cov, estimated_cov)
plt.xlabel("Real")
plt.ylabel("Estimated")
plt.title("Cov(psi, alpha)")
plt.plot([np.min([np.min(real_values_cov), np.min(estimated_cov)]), np.max([np.max(real_values_cov), np.max(estimated_cov)])], [np.min([np.min(real_values_cov), np.min(estimated_cov)]), np.max([np.max(real_values_cov), np.max(estimated_cov)])])
plt.savefig("./Figures/Jan24/BLM_cov_plot.png")
plt.show()

import pandas as pd

data = pd.DataFrame(
     {'real_a': real_values_a,
     'estimated_a': estimated_a,
     'real_psi': real_values_psi
    })

selected = data.loc[data['real_a']<=0.5]

points = plt.scatter(selected['real_a'], selected['estimated_a'], c=selected['real_psi'])
cbar = plt.colorbar(points)
cbar.set_label("real psi")
plt.xlabel("Real")
plt.ylabel("Estimated")
plt.title("Alpha")
plt.plot([0, np.max([np.max(selected['real_a']), np.max(selected['estimated_a'])])], [0, np.max([np.max(selected['real_a']), np.max(selected['estimated_a'])])])
plt.savefig("./Figures/Jan24/BLM_alpha_zoom_plot.png")
plt.show()

print("Monotonic movers: " + str(monotonic_mover.count(True)))
print("Non-monotonic movers: " + str(monotonic_mover.count(False)))

print("Monotonic stayers: " + str(monotonic_stayer.count(True)))
print("Non-monotonic stayers: " + str(monotonic_stayer.count(False)))

While the firm effects appear to be recovered well, many low alpha values are frequently overestimated.

Psi looks good overall: BLM_psi_plot

Alpha is also doing well for higher values: BLM_alpha_plot

But when zooming in on lower values: BLM_alpha_zoom_plot

Lower psi values look fine: BLM_psi_zoom_plot

I suspect that I am missing some parameter I should change for the variance decomposition to work as expected. Could you maybe hint me which one might control this behaviour?

I also found that several log-likelihoods for movers (around 1/3) are not monotonic and none are monotonic for stayers. Do you maybe have any hints on why this might be the case?

Thank you!

Best, Martin

adamoppenheimer commented 9 months ago

Hi Martin,

Thanks for spending so much time with PyTwoWay!

I don't have time right now to look into this deeply, but I think one potential issue is your BLM parameters. a1_sig and a2_sig are pretty high if the true variation is actually near 0. These are going to make the guesses for the group-level means relatively spread out, even though your simulation will make them very close.

Unfortunately the BLM estimator is not guaranteed to converge to the global optimum, so you have to be very careful about how you choose your initial guesses. Can you please let me know what happens if you lower those values?

Also, regarding the log likelihoods, it's fine if they're not completely monotonic as long as the overall trend is increasing. There is a correction to the likelihoods in the code that can make it so it's not monotonic, but the non-monotonicity should be quite small compared to the overall increase over time.

Best, Adam

MartinHaus1993 commented 9 months ago

Hi Adam,

Thanks a lot for this quick response!

I have tried lower values of a1_sig and a2_sig but the behaviour seems similar. Below are the results for both set to 0.05. The results appear to be very similar to having them set to 1.5.

Alpha: BLM_alpha_plot

Zooming in on lower values for alpha: BLM_alpha_zoom_plot

Psi seems fine: BLM_psi_zoom_plot-2

Some alpha values are really well recovered while others are not. Is there anything that might explain this behaviour?

Regarding monotonicity, a graphical inspection of the iterations indicates that the overall trend holds, so this seems to be fine.

Thanks again!

Best, Martin

adamoppenheimer commented 9 months ago

Hi Martin,

Thanks for following up on this again!

One additional idea I have is to see what happens if you lower the variance of the simulated noise to be quite low, in particular you could try something like this:

sim_params = bpd.sim_params({
    'nl': nl, 'nk': nk,
    'alpha_sig': a, 'psi_sig': p, 'w_sig': 0.05,
    'c_sort': 0, 'c_netw': 0, 'c_sig': 1
})

Since this will essentially eliminate any noise, it should help make the estimates much more precise. Then we can diagnose if the issue is that the estimator isn't estimating the alphas precisely enough for the higher variance noise, or that the variance decomposition has an issue even when the alphas are precisely estimated.

Best, Adam

MartinHaus1993 commented 9 months ago

Hi Adam,

Thanks so much for looking into this. You are right! I tried with the simulation parameters and now even the low alphas get estimated correctly:

BLM_alpha_zoom_plot-2

Unfortunately this imprecise estimation of alphas is also the case for the real data I work with - do you have any ideas how this can be improved?

Thank you!

Best wishes, Martin

adamoppenheimer commented 9 months ago

Hi Martin,

This is surprising, the estimator seems to be biased in this case. Maybe a bias correction can be derived, I'll reach out to Professor Lamadon.

I know this is mostly just cheating and isn't a proper solution, but what happens if you simulate with nl=3 but estimate with nl=2?

Maybe part of the issue is that because the noise has higher variance than the alphas, the clusters are moving further apart to better fit the noise. Then if you reduce the number of worker types, it could force the clusters to be closer to each other.

But this probably just mechanically reduces the estimated var(alpha), so it doesn't actually address the problem.

Best, Adam

MartinHaus1993 commented 9 months ago

Hi Adam,

Thanks for this! Yes, I think you are right. On the other end, this leads to an underestimation of var(alpha) when the true variance is not close to 0 (and an overestimation if slightly above). See results below:

Alpha: BLM_alpha_plot-2

And zoomed in on lower values: BLM_alpha_zoom_plot-3

Thanks again for following up on this so promptly, much appreciated.

Best, Martin

MartinHaus1993 commented 9 months ago

Just adding in case this is useful: if I increase the noise further (let's say 1.6), this is the result (using the same nl for simulation and estimation):

BLM_alpha_plot-3

tlamadon commented 9 months ago

Hello and thank you for looking into this! The fact that without noise, it seems to work is comforting. Am I correct that you are using the non-linear BLM estimator? It also seems that you are simulating a model without csort or network effect. I wonder if that means that the identifying assumptions for the paper are not satisfied. It might work still because of the parametric form (normal in simulated data and normal in the model) but it could generate issues. I am not exactly sure what the mobility here is, but I worry it is close to random (all workers moving the say across firms).

Another thing that could be good here, is to use the true clusters in estimation and see if that resolve the issue.

Note that It is possible for the estimator to have a bit of bias in finite sample since it is based on MLE.

MartinHaus1993 commented 9 months ago

Thanks so much for looking into this. Yes, I am using the static BLM estimator. This is all based on simulated data using the BLM example parameters as a baseline (first post).

I think your point might indeed explain things. Going back to the paper, I think assumption 3 indeed fails if var(alpha) is zero or close to it as this would imply cov(psi, alpha) to be zero as no "sorting" can be present. I guess increasing csort would not help here because this only has an effect if there is some variation in alpha. Do I understand this correctly?

I am just wondering what this implies for applied research. If one wants to test whether workers matter/are heterogenous, one should probably be careful to use the BLM estimator as it is only identified if both hold: var(alpha)≠0 and there is some kind of sorting. So, while the BLM estimator is useful IF one can be sure that workers are heterogeneous and there is sorting, it will have an upward bias for var(alpha) and is hence less useful to test for var(alpha)==0?

I have also just run the above with

   sim_params = bpd.sim_params({
            'nl': nl, 'nk': nk,
            'alpha_sig': a, 'psi_sig': p, 'w_sig': 1.6,
            'c_sort': 1, 'c_netw': 0.5, 'c_sig': 1
        })

and here are the results:

Alpha: BLM_alpha_plot-5

Psi: BLM_psi_plot-3

Cov: BLM_cov_plot-2

tlamadon commented 9 months ago

Ok, we might need to look at this last case, even-though the noise at 1.6 might be pretty severe.

About the assumptions, note that they are sufficient conditions for identification for interactions. It is clearly a bit demanding. I guess my question would be whether interaction is something that you want to allow for in your analysis or if you are willing to assume that wages are log linear. If you are ok with log-linear then you can impose this in estimation or even use one of the other estimators.

Another point here is that getting the variance of alpha actually requires making some assumption on the serial correlation of the error over time. For movers, the static version of our model would assume independence and then would assume that it is the same for movers and stayers. It is strong, but with these assumptions it should still work here. A fixed-effect model would likely overstate the variance of alpha, in particularly among the stayers since empirically the serial correlation within spell is quite large.

In my experience the variance of worker effect is very large in employer-employee matched data. It could be different in other applications.

I am going to run some simulation using your code and I will get back to you on what I think is happening! I will be doing this in the debugging branch at https://github.com/tlamadon/pytwoway/tree/debugging

Thank you for your interest!