tlamadon / pytwoway

Two way models in python
MIT License
23 stars 6 forks source link

Fitting the FE and CRE estimators fail on the simulated data example #1

Closed nreigl closed 3 years ago

nreigl commented 3 years ago

Hi,

in a fresh virtualenv with Python 3.9.6 installed I try to run the code for the simulated example provided in the online documentation.

import pytwoway as tw
from bipartitepandas import SimBipartite
# Create SimTwoWay object
sbp_net = SimBipartite()
# Generate data
sim_data = sbp_net.sim_network()
# Below is identical to first example, except we are now using the simulated data
# Create TwoWay object
tw_net = tw.TwoWay(sim_data)
# Fit the FE estimators:
tw_net.fit_fe()

Expected behaviour

On simulated data, I can fit the FE estimator without exceptions being raised.

Observed behaviour

The function tw_net.fit_fe() fails with the following error

Traceback (most recent call last):

 File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'm'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pytwoway/twoway.py", line 147, in fit_fe
    fe_solver.fit_1()
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pytwoway/fe.py", line 193, in fit_1
    self.__prep_vars() # Prepare data
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pytwoway/fe.py", line 237, in __prep_vars
    self.res['n_movers'] = len(np.unique(self.adata[self.adata['m'] == 1]['i']))
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pandas/core/frame.py", line 3455, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'm'

A similar error is raised when trying to fit the CRE estimator with tw_net.fit_cre().

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/pytwoway/twoway.py", line 177, in fit_cre
    cre_solver = tw.CREEstimator(self.data.get_es().get_cs(), user_cre)
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/bipartitepandas/bipartitelongbase.py", line 46, in get_es
    self.gen_m()
  File "/Users/nicolasreigl/Documents/twoways/.venv/lib/python3.9/site-packages/bipartitepandas/bipartitebase.py", line 875, in gen_m
    sorted_cols = sorted(frame.columns, key=col_order)
ValueError: 'k' is not in list

Requirements

attrs==21.2.0
bipartitepandas==0.1.10
ConfigArgParse==1.5.2
cycler==0.10.0
Cython==0.29.24
iniconfig==1.1.1
joblib==1.0.1
kiwisolver==1.3.1
matplotlib==3.4.3
networkx==2.6.2
numpy==1.21.2
packaging==21.0
pandas==1.3.2
patsy==0.5.1
Pillow==8.3.1
pluggy==0.13.1
py==1.10.0
pyamg==4.1.0
pyparsing==2.4.7
pytest==6.2.4
python-dateutil==2.8.2
pytwoway==0.1.9
pytz==2021.1
qpsolvers==1.6.1
quadprog==0.1.8
scikit-learn==0.24.2
scipy==1.7.1
six==1.16.0
statsmodels==0.12.2
threadpoolctl==2.2.0
toml==0.10.2
tqdm==4.62.1
tlamadon commented 3 years ago

Thanks for posting this, we are going to take a look. This is a bit surprising since the circle-ci tests are passing. We have not run the test with python 3.9 though. Did you install with pip?

nreigl commented 3 years ago

Yes, I installed with pip.

adamoppenheimer commented 3 years ago

Hi Nicolas, thank you for pointing out these issues!

To give a short summary of what went wrong, I made mistakes in the documentation and forgot to include the data cleaning step, and the clustering step for the CRE estimator. I also forgot to include code to generate the 'm' column indicating whether the observation belongs to a mover or a stayer (this only applies in the case where you are estimating on long-form non-collapsed data).

I fixed these issues, uploaded a new version to Pip, and updated the documentation. I included the updated code you should run below. Please let us know if this works!

Also note that this code automatically collapses the data (which the previous code didn't). There is really no need to be working with the uncollapsed data as the results should be (almost) numerically identical with lots of computational benefits. The documentation does explain how to run the estimators on non-collapsed data, and feel free to ask if you want more details on running code for that. The notebook here goes into much more detail on all the options you can include (like collapsing the data, etc.).

Below is a more detailed explanation:


The first issue is because the example code is running the estimators on non-collapsed long-format data (I also forgot to include the data cleaning step, although that wouldn't have caused any problems with this specific example since the simulated data is already clean).

The estimator requires the 'm' column to mark observations as belonging to movers or stayers. In previous versions of the package, the FE estimator used to require event study format data but I recently changed the code so it uses long-format data.

BipartitePandas automatically creates the 'm' column when (a) data is collapsed or (b) data is converted to event study, so in previous versions the 'm' column was automatically created no matter how you ran the code.

Now, since the data used in the example code is just the long form, non-collapsed data, there is no 'm' column. There is a function, .gen_m(), that will generate the 'm' column. I updated the code to automatically include this.


The second issue was an issue with the documentation, thank you for pointing that out. I forgot to include 2 steps: first, the data cleaning step, and second the clustering step.


I just uploaded a new version to Pip, version 1.10. Make sure to update it, and try running the following code:

import pytwoway as tw
from bipartitepandas import SimBipartite
# Create SimTwoWay object
sbp_net = SimBipartite()
# Generate data
sim_data = sbp_net.sim_network()
# Below is identical to first example, except we are now using the simulated data
# Create TwoWay object
tw_net = tw.TwoWay(sim_data)
# Clean data
tw_net.prep_data()
# Fit the FE estimators:
tw_net.fit_fe()

Then for CRE, add the following:

# Cluster to prepare for CRE
tw_net.cluster()
# Fit the CRE estimator
tw_net.fit_cre()

Again, thank you for pointing out these issues! Please let us know if it works now.

adamoppenheimer commented 3 years ago

I'll also leave this example code for running the heteroskedastic correction:

import pytwoway as tw
from bipartitepandas import SimBipartite
# Create SimTwoWay object
sbp_net = SimBipartite()
# Generate data
sim_data = sbp_net.sim_network()
# Below is identical to first example, except we are now using the simulated data
# Create TwoWay object
tw_net = tw.TwoWay(sim_data)
# Clean data
tw_net.prep_data(he=True)
# Fit the FE estimators:
tw_net.fit_fe({'he': True})
adamoppenheimer commented 3 years ago

I just uploaded some more small changes, the most recent version should now be version 1.11.

nreigl commented 3 years ago

I have updated the package via pip and tried to run the provided example code. Everything seems to work now.

I might have run into another problem for the Python code example but I need to test some things first and would open another issue separately.

Thank you for the fast response and fix.