tlamadon / pytwoway

Two way models in python
MIT License
23 stars 7 forks source link

Retrieving the firm and worker identifiers (Pytwoway 0.1.14.) #5

Closed k-segiet closed 3 years ago

k-segiet commented 3 years ago

My goal is to estimate the firm and worker fixed effects (psi_hat and alpha_hat) and to merge them to the original population by the firm and worker identifiers (j and i). However, when running the prep_data() function of the TwoWay class, the identifiers i and j are changed and run from 0 to J for firm identifiers j (where J is the number of firms) and from 0 to N for worker identifiers i (where N is the number of workers).

How could I modify the code so that the original firm and worker identifiers are unchanged, which would enable me to merge the estimated psi_hat and alpha_hat to the original population by the firm and worker identifiers (j and i)?

Thank you for your help.

k-segiet commented 3 years ago

Adam's answer:

Thank you for reaching out!

I wrote up some example code to illustrate how this can be done. This takes advantage of the option include_id_reference_dict in BipartitePandas. Unfortunately this means that the data cleaning must be done manually, but it's just a few extra lines of code.

To run this on your own code, you can replace sim_data with your own data, and delete the line that takes the subset of i < 100.

Also note that I used some options I added after this issue was raised on the github, which makes it so it only generates the fixed effects and doesn't estimate the variance/covariances.

Best, Adam

import bipartitepandas as bpd
import pytwoway as tw
import pandas as pd

#### Simulate data
sim_data = bpd.SimBipartite({'nk': 50, 'num_time': 2, 'num_ind': 1000}).sim_network()
#### Manually clean data
bdf = bpd.BipartiteLong(sim_data, include_id_reference_dict=True) # Set include_id_reference_dict=True to save original ids
#### Subset of data so largest connected set is subset of all firms
bdf = bdf[bdf['i'] < 100]
bdf = bdf.clean_data()
bdf.gen_m()

#### Create TwoWay object
tw_net = tw.TwoWay(bdf.original_ids()) # bdf.original_ids() creates a dataframe with columns that give the original ids
#### Skip data cleaning step in TwoWay object, but mark data as clean
tw_net.clean = True

fe_params = {
'ncore': 1, # Number of cores to use
'batch': 1, # Batch size to send in parallel
'ndraw_pii': 50, # Number of draws to use in approximation for leverages
'levfile': '', # File to load precomputed leverages
'ndraw_tr': 5, # Number of draws to use in approximation for traces
'he': False, # If True, compute heteroskedastic correction
'out': 'res_fe.json', # Outputfile where results are saved
'statsonly': False, # If True, return only basic statistics
'feonly': True, # If True, compute only fixed effects and not variances
'Q': 'cov(alpha, psi)' # Which Q matrix to consider. Options include 'cov(alpha, psi)' and 'cov(psi_t, psi_{t+1})'
}

#### Since we set 'feonly': True, we just run the estimator normally and it only estimates the fixed effects to save time
tw_net.fit_fe(fe_params)

#### Now look at the data
new_data = tw_net.data

I would also recommend setting the following for better performance:

bdf = bdf.clean_data({'data_validity': False})

But also be careful that this isn't designed to work if you are manipulating the data or reformatting the data (for instance from long to event study, etc.) after data cleaning, so you should verify it is working properly in your case before committing to using it.