modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.74k stars 651 forks source link

Multiple groupby operations in succession for a dataframe generated in-memory fails with ValueError('Length Mismatch') #4287

Open suhailrehman opened 2 years ago

suhailrehman commented 2 years ago

System information

Column Label -> Faker Format Mapping

base_artifact_schema= { "Pqt4jisbn10": "isbn10", "lrVJaean8": "ean8", "goWHdssn": "ssn", "HjYM3ipv4_network_class": "ipv4_network_class", "wvamFpostalcode_plus4": "postalcode_plus4", "LqReRlast_name_female": "last_name_female", "fESWJpyint": "pyint", "oGXTupyint": "pyint", "l8leachrome": "chrome", "EtC7Turi": "uri" }

Stripped down version of randomized table generator for bug reproduction

def generate_table(num_rows=100, schema=None): faker = Faker()

series_list = []
label_list = []

for label, column in schema.items():
    series_list.append(pd.Series((faker.format(column) for _ in range(num_rows))))
    label_list.append(label)

return pd.concat(series_list, axis=1, keys=label_list)

The following code operates on the generated, in-memory version of the dataframe artifact_0

The second groupby fails with "ValueError('Length mismatch: Expected axis has 0 elements, new values have 3 elements')"

artifact_0 = generate_table(num_rows=1000, schema=base_artifact_schema) artifact_1 = artifact_0[["HjYM3ipv4_network_class", "fESWJpyint","oGXTupyint"]].groupby("HjYM3ipv4_network_class").count().reset_index() artifact_2 = artifact_1[["HjYM3ipv4_network_class", "fESWJpyint","oGXTupyint"]].groupby("HjYM3ipv4_network_class").sum().reset_index() print(artifact_2)


### Describe the problem
A dataframe generated in memory using `pd.concat()` with a list of `pd.Series` objects cannot perform these two groupby operations in succession, failing with a `ValueError('Length mismatch: Expected axis has 0 elements, new values have 3 elements')`.

If the dataframe is written out to CSV and re-loaded, the operation succeeds:

```python
# The following version, which writes the dataframe to disk and reloads it, succeeeds
artifact_0.to_csv('/tmp/artifact_0.csv')
artifact_0 = pd.read_csv('/tmp/artifact_0.csv',  index_col=0)
artifact_1 = artifact_0[["HjYM3__ipv4_network_class", "fESWJ__pyint","oGXTu__pyint"]].groupby("HjYM3__ipv4_network_class").count().reset_index()
artifact_2 = artifact_1[["HjYM3__ipv4_network_class", "fESWJ__pyint","oGXTu__pyint"]].groupby("HjYM3__ipv4_network_class").sum().reset_index()
print(artifact_2)

Source code / logs

Exact traceback attached: Traceback Virtual Environment: pip freeze output

mvashishtha commented 2 years ago

@suhailrehman thank you for the detailed report. I'm unable to use your code to reproduce the bug on my computer, because I don't have base_artifact_schema. Could you please send a snippet to generate that as well?

suhailrehman commented 2 years ago

@mvashishtha sorry for the variable mismatch. It should be fixed now (edited in place)

mvashishtha commented 2 years ago

@suhailrehman thank you. I can reproduce the bug now on the latest Modin source. I'll do some triage now and if I can't make a quick fix, we'll try to fix the bug as soon as possible.

mvashishtha commented 2 years ago

@modin This looks like a bug with lazy metadata propagation. If I remove this condition so that groupby_reduce always operates on an updated index, I can fix the bug. I guess we need to fix that condition.