Closed shreya2110 closed 3 years ago
What is the difference in your two DataFrames? You could pandas assert_frame_equal to check that the CSV load and the BQ are the same, or to identify differences.
@bashtage The test dataframe is created off a csv which is a sample extract of the same GBQ table that is being loaded in the main project
If you extract the same rows from the BQ DataFrame as you have in the csv, and assert_frame_equal, are they?
On Wed, Oct 6, 2021, 18:26 Shreya Prabhu @.***> wrote:
The test dataframe is created off a csv which is a sample extract of the same GBQ table that is being loaded in the main project
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/patsy/issues/182#issuecomment-936742276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKTSROMG64YSXNVKM5K4RTUFSBFZANCNFSM5FNGBZRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Issue resolved after updating patsy to 5.1
CODE: formula = 'numeric_field ~ C(boolean_field_a):C(object_field) + C(boolean_field_b) + C(boolean_field_a):C(object_field):C(boolean_field_b)'
ERROR: File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/statsmodels/base/model.py", line 159, in from_formula INFO: root:[Dataproc ☁] missing = missing) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/statsmodels/formula/formulatools.py", line 65, in handle_formula_data INFO: root:[Dataproc ☁] NA_action = na_action) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices INFO: root:[Dataproc ☁] NA_action, return_type) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design INFO: root:[Dataproc ☁] NA_action) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/highlevel.py", line 62, in _try_incr_builders INFO: root:[Dataproc ☁] formula_like = ModelDesc.from_formula(formula_like) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/desc.py", line 164, in from_formula INFO: root:[Dataproc ☁] tree = parse_formula(tree_or_string) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/parse_formula.py", line 148, in parse_formula INFO: root:[Dataproc ☁] _atomic_token_types) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/infix_parser.py", line 210, in infix_parse INFO: root:[Dataproc ☁] for token in token_source: INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/parse_formula.py", line 94, in _tokenize_formula INFO: root:[Dataproc ☁] yield _read_python_expr(it, end_tokens) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/parse_formula.py", line 44, in _read_python_expr INFO: root:[Dataproc ☁] for pytype, token_string, origin in it: INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/util.py", line 332, in next INFO: root:[Dataproc ☁] return six.advance_iterator(self._it) INFO: root:[Dataproc ☁] File "/tmp/tdc-run--my_project/lib/python3.7/site-packages/patsy/tokens.py", line 35, in python_tokenize INFO: root:[Dataproc ☁] assert pytype not in (tokenize.NL, tokenize.NEWLINE) INFO: root:[Dataproc ☁] AssertionError
NOTE: When I use a pandas dataframe created off a csv - the formula works fine and I get a model output. Local data size : 50k records When I use a pandas dataframe created by running a google big query- the model does not get created and I get the above error. Actual data size: 3.8 million records object_field has 26 string values in it.