Closed MartinFriedrich93 closed 7 months ago
Hello again,
Today, I tried to work with a larger number of draws for the he-approximation as mentioned in the documentation. Setting ndraw_trace_he
to 100 and ndraw_lev_he
to 1000 did not resolve the issue though unfortunately. The he-estimate of var(xcont) is even more off from the true value if I try these options (see output).
Any help with this issue would be highly appreciated.
Thank you so much, Martin
Output:
results of plug-in (AKM) estimation:
var total share
var_total 0.834411 0.834411 1.000000
var_alpha 0.372120 0.834411 0.445967
var_psi 0.191529 0.834411 0.229538
var_xcont 0.094334 0.834411 0.113054
var_res 0.002773 0.834411 0.003323
2cov_psi_alpha 0.118499 0.834411 0.142015
2cov_alpha_xcont 0.004128 0.834411 0.004947
2cov_psi_xcont 0.051694 0.834411 0.061952
sum of the variance shares (should be ~1) and is:
1.0007964596391798
results of he-robust (KSS) estimation:
var total share
var_total 0.834411 0.834411 1.000000
var_alpha 0.296473 0.834411 0.355308
var_psi 0.147322 0.834411 0.176558
var_xcont 0.390887 0.834411 0.468459
var_res 0.097691 0.834411 0.117078
2cov_psi_alpha 0.204301 0.834411 0.244845
2cov_alpha_xcont 0.004160 0.834411 0.004986
2cov_psi_xcont 0.051657 0.834411 0.061909
sum of the variance shares (should be ~1) and is:
1.4291427903907292
compare true values to estimates:
index true plugin he
0 var_total 0.897533 0.834411 0.834411
1 var_alpha 0.298951 0.372120 0.296473
2 var_psi 0.148539 0.191529 0.147322
3 var_xcont 0.063122 0.094334 0.390887
4 var_res 0.099872 0.002773 0.097691
5 2cov_psi_alpha 0.214297 0.118499 0.204301
6 2cov_alpha_xcont 0.031172 0.004128 0.004160
7 2cov_psi_xcont 0.042653 0.051694 0.051657
Total Elapsed Time: 153.37 minutes
Updated code with ndraw_trace_he
set to 100 and ndraw_lev_he
set to 1000:
###############################################################################
### simulate some data where you know the true variance of wage components
###############################################################################
import numpy as np
import pandas as pd
import pytwoway as tw
import bipartitepandas as bpd
import multiprocess
import openpyxl
import os
import matplotlib.pyplot as plt
import time
start_time = time.time()
###############################################################################
### system req.
###############################################################################
# The code was last run on a 10-core 2.5GHz Intel processor with 256 GB RAM,
# the minimum required ammount of RAM is round about 8-10GB.
# The estimation finished in ~ 16 minutes.
###############################################################################
### parameters
###############################################################################
np.random.seed(43532)
##### gen data frame with workers (N=w_obs) and f_obs (N=f_obs) firms
w_obs = 500000
f_obs = 20000
# w_obs = 50000
# f_obs = 2000
# influence limited mobility bias via probability of moving firm in t=2 and t=3
move_prob = 0.05
### set amount of sorting
# this can range from 0 (perfect sorting) to 1 (no sorting at all)
sorting = 0.5
###############################################################################
### simulate data
###############################################################################
# gen worker IDs and 3-year panel
data1 = {'i': range(1,w_obs+1), 't':1}
data1 = pd.DataFrame(data1)
data2 = {'i': range(1,w_obs+1), 't':2}
data2 = pd.DataFrame(data2)
data3 = {'i': range(1,w_obs+1), 't':3}
data3 = pd.DataFrame(data3)
df = pd.concat([data1,data2]).sort_values('i')
df = pd.concat([df,data3]).sort_values('i')
# gen worker FE
data4 = {'i': range(1,w_obs+1), 'iFE': np.random.normal(1,np.sqrt(0.3),w_obs)}
data4 = pd.DataFrame(data4)
df = pd.merge(df, data4, on='i')
del data1
del data2
del data3
del data4
# gen firm IDs and introduce sorting
df['iFE_dec'] = pd.qcut(df['iFE'], q=f_obs, labels=False) + 1
df['j_range_min'] = df['iFE_dec'] - int(sorting*f_obs)
df['j_range_min'] = np.where(df['j_range_min']<1, 1, df['j_range_min']).astype('int')
df['j_range_max'] = df['iFE_dec'] + int(sorting*f_obs)
df['j_range_max'] = np.where(df['j_range_max']>f_obs, f_obs, df['j_range_max']).astype('int')
df['j'] = np.random.randint(low=df['j_range_min'], high=df['j_range_max'] + 1, size=len(df))
del df['iFE_dec']
# look at firm size distribution
# fsize = df.groupby(['j','t']).size()
# print("firm size distribution:")
# print(fsize.describe())
# plt.hist(fsize, bins=100, edgecolor='black')
# plt.show()
#increase limited mobility bias artifically:
df = df.sort_values(by=['i','t'])
df['j_tmin1'] = df.groupby('i')['j'].shift(1)
df_help = df[df['t']>1]
df_help['move_help'] = np.random.choice([0,1],size=len(df_help),p=[1-move_prob,move_prob])
df_help['move'] = np.where(df_help['move_help']==1,1,0)
df_help = df_help[['i','t','move']]
df=pd.merge(df,df_help,on=['i','t'],how='outer')
df.loc[df['move']==0, 'j'] = df['j_tmin1']
df_fail = df[(df['move']==1) & (df['j']==df['j_tmin1'])]
def generate_j_new(row):
new_value = row['j']
while new_value == row['j']:
new_value = np.random.randint(row['j_range_min'],row['j_range_max']+1)
return new_value
df_fail['j_new'] = df_fail.apply(generate_j_new, axis=1)
df_fail = df_fail[['i','t','j_new']]
df=pd.merge(df,df_fail,on=['i','t'],how='outer')
df['j'] = np.where(~df['j_new'].isna(), df['j_new'], df['j'])
df['j_tmin1'] = df.groupby('i')['j'].shift(1)
df.loc[df['move']==0, 'j'] = df['j_tmin1']
del df['j_new']
del df['j_tmin1']
del df['move']
del df['j_range_min']
del df['j_range_max']
del df_fail
del df_help
# measure probability to move in any year (to check that it worked)
# df_help = df
# df_help['j_tmin1'] = df_help.groupby('i')['j'].shift(1)
# df_help = df[df['t'] > 1]
# df_help['move'] = np.where(df_help['j']==df_help['j_tmin1'],0,1)
# print(df_help.head(9))
# print(df_help['move'].describe())
# gen firm FE
data2 = {'j': range(1,f_obs+1)}
df2 = pd.DataFrame(data2)
df2['jFE'] = np.random.normal(1,np.sqrt(0.2),f_obs)
df2 = df2.sort_values('jFE').reset_index()
df2['j_new'] = df2.index + 1
del df2['j']
df2['j'] = df2['j_new']
df2 = df2[['j', 'jFE']]
df = pd.merge(df, df2, on='j')
del df2
df = df.sort_values(by=['i','t'])
# gen xcont and introduce positive covariance with firm FE (and hence also worker FE)
df_help = df[['i','jFE']].groupby('i').agg('mean').reset_index()
df_help['xcont'] = np.random.normal(1000,np.sqrt(600),w_obs) + 15*df_help['jFE']
del df_help['jFE']
df = pd.merge(df,df_help,on='i')
del df_help
# gen xcont coefficient
df['xcont_coef'] = df['xcont'] * 0.01
# gen residual
df['res'] = np.random.normal(0,np.sqrt(0.1),len(df))
# gen wage
df['y'] = df['iFE'] + df['jFE'] + df['xcont_coef'] + df['res']
#aggregate true variances in full data set
# var_y = df['y'].agg('var')
# var_iFE = df['iFE'].agg('var')
# var_jFE = df['jFE'].agg('var')
# var_xcont = df['xcont_coef'].agg('var')
# var_res = df['res'].agg('var')
# cov_iFE_jFE = np.cov([df['iFE'],df['jFE']])[0][1]
# cov_iFE_xcont = np.cov([df['iFE'],df['xcont_coef']])[0][1]
# cov_jFE_xcont = np.cov([df['jFE'],df['xcont_coef']])[0][1]
# print(var_iFE + var_jFE + var_xcont + var_res + 2*cov_iFE_jFE + 2*cov_iFE_xcont+
# 2*cov_jFE_xcont)
# print(var_y)
###############################################################################
### isolate looc set
###############################################################################
df['worker_id'] = df['i']
df['firm_id'] = df['j']
clean_params = bpd.clean_params(
{
'connectedness': 'leave_out_spell',
'collapse_at_connectedness_measure': True,
'drop_single_stayers': False,
'drop_returns': False,
'copy': False
}
)
# Convert into BipartitePandas DataFrame and clean
bdf = bpd.BipartiteDataFrame(df).clean(clean_params)
###############################################################################
### measure variances and covariances in the looc set
###############################################################################
# isolate looc set and make sure #obs lines up
looc_workers = len(bdf[['worker_id']].drop_duplicates())
looc = bdf[['firm_id']].drop_duplicates()
df_looc = pd.merge(df,looc,on=['firm_id'],how='inner')
df_looc_workers = len(df_looc[['worker_id']].drop_duplicates())
print(looc_workers, df_looc_workers)
# aggregate true variances in looc data set
var_y = df_looc['y'].agg('var')
var_iFE = df_looc['iFE'].agg('var')
var_jFE = df_looc['jFE'].agg('var')
var_xcont = df_looc['xcont_coef'].agg('var')
var_res = df_looc['res'].agg('var')
cov_iFE_jFE = np.cov([df_looc['iFE'],df_looc['jFE']])[0][1]
cov_iFE_xcont = np.cov([df_looc['iFE'],df_looc['xcont_coef']])[0][1]
cov_jFE_xcont = np.cov([df_looc['jFE'],df_looc['xcont_coef']])[0][1]
# print(var_iFE + var_jFE + var_xcont + var_res + 2*cov_iFE_jFE + 2*cov_iFE_xcont+
# 2*cov_jFE_xcont)
# print(var_y)
###############################################################################
### decompose wages variance
###############################################################################
cts_control = ['xcont']
fecontrol_params = tw.fecontrol_params(
{
'he': True,
'ho': False,
'continuous_controls': cts_control,
'Q_var': [
tw.Q.VarCovariate('psi'),
tw.Q.VarCovariate('alpha'),
tw.Q.VarCovariate(cts_control)
],
'Q_cov': [
tw.Q.CovCovariate('psi', 'alpha'),
tw.Q.CovCovariate('alpha', cts_control),
tw.Q.CovCovariate('psi', cts_control),
],
'ncore': 10,
'ndraw_lev_he': 1000,
'ndraw_trace_he': 100
}
)
# Initialize FE estimator
fe_estimator = tw.FEControlEstimator(bdf, fecontrol_params)
# Fit FE estimator
fe_estimator.fit()
print(fe_estimator.summary)
###############################################################################
### report the results in a table
###############################################################################
### save decomp results
results = pd.DataFrame([fe_estimator.summary])
results_plugin = results.copy(deep=True)
results_he = results.copy(deep=True)
#print(results.columns)
# plugin
results_plugin['var_total'] = results.iloc[0,0]
results_plugin['var_res'] = results.iloc[0,1]
results_plugin['var_alpha'] = results.iloc[0,3]
results_plugin['var_psi'] = results.iloc[0,5]
results_plugin['var_xcont'] = results.iloc[0,7]
results_plugin['2cov_alpha_xcont'] = 2*results.iloc[0,9]
results_plugin['2cov_psi_alpha'] = 2*results.iloc[0,11]
results_plugin['2cov_psi_xcont'] = 2*results.iloc[0,13]
results_plugin = results_plugin[['var_total', 'var_alpha', 'var_psi',
'var_xcont', 'var_res', '2cov_psi_alpha',
'2cov_alpha_xcont', '2cov_psi_xcont']].transpose()
results_plugin = results_plugin.rename(columns={0:'var'})
results_plugin['total'] = results_plugin.iloc[0,0]
results_plugin['share'] = results_plugin['var'] / results_plugin['total']
print('results of plug-in (AKM) estimation:')
print(results_plugin)
print('sum of the variance shares (should be ~1) and is:')
# check that shares add up to 1
print(results_plugin.iloc[1,2] + results_plugin.iloc[2,2] + results_plugin.iloc[3,2] +
results_plugin.iloc[4,2] + results_plugin.iloc[5,2] + results_plugin.iloc[6,2] +
results_plugin.iloc[7,2])
# he
results_he['var_total'] = results.iloc[0,0]
results_he['var_res'] = results.iloc[0,2]
results_he['var_alpha'] = results.iloc[0,4]
results_he['var_psi'] = results.iloc[0,6]
results_he['var_xcont'] = results.iloc[0,8]
results_he['2cov_alpha_xcont'] = 2*results.iloc[0,10]
results_he['2cov_psi_alpha'] = 2*results.iloc[0,12]
results_he['2cov_psi_xcont'] = 2*results.iloc[0,14]
results_he = results_he[['var_total', 'var_alpha', 'var_psi',
'var_xcont', 'var_res', '2cov_psi_alpha',
'2cov_alpha_xcont', '2cov_psi_xcont']].transpose()
results_he = results_he.rename(columns={0:'var'})
results_he['total'] = results_he.iloc[0,0]
results_he['share'] = results_he['var'] / results_he['total']
print('results of he-robust (KSS) estimation:')
print(results_he)
print('sum of the variance shares (should be ~1) and is:')
# check that shares add up to 1
print(results_he.iloc[1,2] + results_he.iloc[2,2] + results_he.iloc[3,2] +
results_he.iloc[4,2] + results_he.iloc[5,2] + results_he.iloc[6,2] +
results_he.iloc[7,2])
### produce table on true and estmated variance components
data_table = {'var_total': var_y,'var_alpha':var_iFE, 'var_psi':var_jFE,
'var_xcont':var_xcont, 'var_res': var_res,
'2cov_psi_alpha':2*cov_iFE_jFE,
'2cov_alpha_xcont':2*cov_iFE_xcont,
'2cov_psi_xcont':2*cov_jFE_xcont}
table_true_vs_estimate = pd.DataFrame(data_table, index=[0]).transpose()
table_true_vs_estimate['true']=table_true_vs_estimate[0]
del table_true_vs_estimate[0]
table_true_vs_estimate = pd.merge(table_true_vs_estimate,results_plugin,
left_index=True, right_index=True)
table_true_vs_estimate['plugin'] = table_true_vs_estimate['var']
table_true_vs_estimate = table_true_vs_estimate[['true', 'plugin']]
table_true_vs_estimate = pd.merge(table_true_vs_estimate,results_he,
left_index=True, right_index=True)
table_true_vs_estimate['he'] = table_true_vs_estimate['var']
table_true_vs_estimate = table_true_vs_estimate[['true', 'plugin', 'he']].reset_index()
print('compare true values to estimates:')
print(table_true_vs_estimate)
# table_true_vs_estimate.to_excel('True_vs_plugin_he_w{}_f{}_mprob{}_sort{}.xlsx'
# .format(w_obs,f_obs,move_prob,sorting),
# index=False)
end_time = time.time()
elapsed_time = (end_time - start_time)/60
print(f"Total Elapsed Time: {elapsed_time:.2f} minutes")
Hello,
Just a quick update on my (limited) progress: I have tried the above code as well with bpd option 'collapse_at_connectedness_measure': False
. This solved the issue that the estimated total variance differed from the true observed total variance. I read in another thread that this is because observations are weighted by spell length if collapsed. I am working now with tw options 'ndraw_trace_he': 10'
and 'ndraw_lev_he': 400
(double the default values). However, the main problem remains that the estimated var_xcont, 2cov_psi_alpha, and 2cov_psi_xcont differ significantly from the true values.
Best, Martin
Updated output:
results of plug-in (AKM) estimation:
var total share
var_total 0.897532 0.897532 1.000000
var_alpha 0.372120 0.897532 0.414603
var_psi 0.191529 0.897532 0.213395
var_xcont 0.094334 0.897532 0.105103
var_res 0.065230 0.897532 0.072677
2cov_psi_alpha 0.118499 0.897532 0.132027
2cov_alpha_xcont 0.004128 0.897532 0.004599
2cov_psi_xcont 0.051694 0.897532 0.057595
sum of the variance shares (should be ~1) and is:
1.0000000046743283
results of he-robust (KSS) estimation:
var total share
var_total 0.897532 0.897532 1.000000
var_alpha 0.295728 0.897532 0.329490
var_psi 0.146728 0.897532 0.163479
var_xcont -1.175594 0.897532 -1.309807
var_res 0.099172 0.897532 0.110495
2cov_psi_alpha 0.205649 0.897532 0.229128
2cov_alpha_xcont 0.004118 0.897532 0.004588
2cov_psi_xcont 0.051596 0.897532 0.057486
sum of the variance shares (should be ~1) and is:
-0.41514117628984376
compare true values to estimates:
index true plugin he
0 var_total 0.897533 0.897532 0.897532
1 var_alpha 0.298951 0.372120 0.295728
2 var_psi 0.148539 0.191529 0.146728
3 var_xcont 0.063122 0.094334 -1.175594
4 var_res 0.099872 0.065230 0.099172
5 2cov_psi_alpha 0.214297 0.118499 0.205649
6 2cov_alpha_xcont 0.031172 0.004128 0.004118
7 2cov_psi_xcont 0.042653 0.051694 0.051596
total elapsed time: 32.99 minutes
Hi Martin,
I am sorry for not getting back, I've been busy the past few days.
I tried out your code and am not sure what the issue is, thank you for pointing this out.
Like you mentioned, I tried it without collapsing and also with collapsing but without weighting, and both cases fixed the var(y)
but didn't fix the other variances.
I apologize for this issue, once I get time I am going to try to figure out what is causing it.
Best, Adam
Hi Adam,
No worries. If you find time to look into the issue at some point, I would appreciate it a lot. Meanwhile, the package still does an amazing job at estimating var(alpha), var(psi) and cov(alpha, psi). So, thank you guys again so much for developing it!
Best Martin
Hi Martin,
I think this is an identification issue. We should try to understand how to overcome this, since this seems like it could be an obstacle to identification whenever you want to include continuous controls in these types of regressions.
Here's what I tried:
A'A
matrix becomes feasible):
w_obs = 5000
f_obs = 200
bdf['j'] = bdf['j'].astype(int)
after you clean the data, since it wasn't converting into an integer (I need to fix this in BipartitePandas)from scipy.sparse import csc_matrix, hstack
nn = len(bdf)
nf = bdf.n_firms()
nw = bdf.n_workers()
idx = np.arange(nn)
J = csc_matrix((np.ones(nn), (idx, bdf.loc[:, 'j'].to_numpy())), shape=(nn, nf))
J = J[:, 1:]
W = csc_matrix((np.ones(nn), (idx, bdf.loc[:, 'i'].to_numpy())), shape=(nn, nw))
A = hstack([J, W, bdf.loc[:, 'xcont_coef'].to_numpy()[:, None]])
AA = np.array((A.T @ A).todense())
Y = bdf.loc[:, 'y'].to_numpy()
gamma_hat = np.linalg.inv(AA) @ A.T @ Y
e_hat = Y - A @ gamma_hat
Notice that if you investigate the continuous coefficient (which you can check by running gamma_hat[-1]
), it is biased. In fact, if you change which firm you normalize (e.g. J[:, :-1]
instead of J[:, 1:]
), this changes the coefficients, and also changes e_hat
.
Unfortunately, I don't know how to proceed, do you have any thoughts?
Best, Adam
Hi Martin,
Please ignore the previous post, I figured out the issue.
I ran np.linalg.matrix_rank(AA)
and was (unsurprisingly) finding it was lower rank than the dimensions of the matrix.
Then I looked at how you simulate the continuous control, and it is identical at the worker level. This is why the regression is not identified - the continuous control is indistinguishable from an individual fixed effect.
I changed the code to simulate it to
df['xcont'] = np.random.normal(1000,np.sqrt(600),len(df)) + 15*df['jFE']
and now the code works.
Please let me know if it works for you after making this change!
Best, Adam
Hi Adam,
Thank you so much for looking into it! The code does indeed work now. I should have spotted the identification problem myself. Sorry to have consumed your time on this issue. Thank you so much again!
Best Martin
Hello,
I am interested in estimating the contributions of var(xcont), cov(alpha, xcont) and cov(psi,xcont) to the total var(y), where xcont is a single continuous control variable. I simulated some worker-firm data with limited mobility bias and positive var(xcont), cov(alpha, xcont) and cov(psi,xcont). I then used the fecontrol-estimator with the he-correction. The he-corrected estimates of var(alpha), var(psi) and cov(alpha, psi) look almost perfect. However, the estimates of var(xcont), cov(alpha, xcont) and cov(psi,xcont) are significantly off from the true values. Also, the estimated y(var) is significantly smaller than the true var(y). It looks like y is residualized in the process so any variation due to xcont is taken out, but I am not sure if this is what is happening. Last, the he-corrected variance estimates of the single variance components do not add up to the total variance.
Can you please help me to understand these apparent inconsistencies? Did I specify something wrong in my code? Or was the package actually not built to estimate var(xcont), cov(alpha, xcont) and cov(psi,xcont) but merely to take out any variation in y that is due to xcont?
Thank you for the great package and any help in advance!
Best Martin
Code:
Output: