Assertion Error while creating a model using statsmodel and pandas dataframe

shreya2110 commented 3 years ago

CODE: formula = 'numeric_field ~ C(boolean_field_a):C(object_field) + C(boolean_field_b) + C(boolean_field_a):C(object_field):C(boolean_field_b)'

model = smf.glm(formula=formula, data=test_df,
                family=sm.families.Poisson()).fit()

ERROR: File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/statsmodels/base/model.py", line 159, in from_formula INFO: root:[Dataproc ☁] missing = missing) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/statsmodels/formula/formulatools.py", line 65, in handle_formula_data INFO: root:[Dataproc ☁] NA_action = na_action) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices INFO: root:[Dataproc ☁] NA_action, return_type) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design INFO: root:[Dataproc ☁] NA_action) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/highlevel.py", line 62, in _try_incr_builders INFO: root:[Dataproc ☁] formula_like = ModelDesc.from_formula(formula_like) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/desc.py", line 164, in from_formula INFO: root:[Dataproc ☁] tree = parse_formula(tree_or_string) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/parse_formula.py", line 148, in parse_formula INFO: root:[Dataproc ☁] _atomic_token_types) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/infix_parser.py", line 210, in infix_parse INFO: root:[Dataproc ☁] for token in token_source: INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/parse_formula.py", line 94, in _tokenize_formula INFO: root:[Dataproc ☁] yield _read_python_expr(it, end_tokens) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/parse_formula.py", line 44, in _read_python_expr INFO: root:[Dataproc ☁] for pytype, token_string, origin in it: INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/util.py", line 332, in next INFO: root:[Dataproc ☁] return six.advance_iterator(self._it) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/tokens.py", line 35, in python_tokenize INFO: root:[Dataproc ☁] assert pytype not in (tokenize.NL, tokenize.NEWLINE) INFO: root:[Dataproc ☁] AssertionError

NOTE: When I use a pandas dataframe created off a csv - the formula works fine and I get a model output. Local data size : 50k records When I use a pandas dataframe created by running a google big query- the model does not get created and I get the above error. Actual data size: 3.8 million records object_field has 26 string values in it.

bashtage commented 3 years ago

What is the difference in your two DataFrames? You could pandas assert_frame_equal to check that the CSV load and the BQ are the same, or to identify differences.

shreya2110 commented 3 years ago

@bashtage The test dataframe is created off a csv which is a sample extract of the same GBQ table that is being loaded in the main project

bashtage commented 3 years ago

If you extract the same rows from the BQ DataFrame as you have in the csv, and assert_frame_equal, are they?

On Wed, Oct 6, 2021, 18:26 Shreya Prabhu @.***> wrote:

The test dataframe is created off a csv which is a sample extract of the same GBQ table that is being loaded in the main project

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/patsy/issues/182#issuecomment-936742276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKTSROMG64YSXNVKM5K4RTUFSBFZANCNFSM5FNGBZRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

shreya2110 commented 3 years ago

Issue resolved after updating patsy to 5.1

pydata / patsy

Assertion Error while creating a model using statsmodel and pandas dataframe #182