Open suhailrehman opened 2 years ago
@suhailrehman thank you for the detailed report. I'm unable to use your code to reproduce the bug on my computer, because I don't have base_artifact_schema
. Could you please send a snippet to generate that as well?
@mvashishtha sorry for the variable mismatch. It should be fixed now (edited in place)
@suhailrehman thank you. I can reproduce the bug now on the latest Modin source. I'll do some triage now and if I can't make a quick fix, we'll try to fix the bug as soon as possible.
@modin This looks like a bug with lazy metadata propagation. If I remove this condition so that groupby_reduce
always operates on an updated index, I can fix the bug. I guess we need to fix that condition.
System information
modin.__version__
): 0.13.2Column Label -> Faker Format Mapping
base_artifact_schema= { "Pqt4jisbn10": "isbn10", "lrVJaean8": "ean8", "goWHdssn": "ssn", "HjYM3ipv4_network_class": "ipv4_network_class", "wvamFpostalcode_plus4": "postalcode_plus4", "LqReRlast_name_female": "last_name_female", "fESWJpyint": "pyint", "oGXTupyint": "pyint", "l8leachrome": "chrome", "EtC7Turi": "uri" }
Stripped down version of randomized table generator for bug reproduction
def generate_table(num_rows=100, schema=None): faker = Faker()
The following code operates on the generated, in-memory version of the dataframe artifact_0
The second groupby fails with "ValueError('Length mismatch: Expected axis has 0 elements, new values have 3 elements')"
artifact_0 = generate_table(num_rows=1000, schema=base_artifact_schema) artifact_1 = artifact_0[["HjYM3ipv4_network_class", "fESWJpyint","oGXTupyint"]].groupby("HjYM3ipv4_network_class").count().reset_index() artifact_2 = artifact_1[["HjYM3ipv4_network_class", "fESWJpyint","oGXTupyint"]].groupby("HjYM3ipv4_network_class").sum().reset_index() print(artifact_2)
Source code / logs
Exact traceback attached: Traceback Virtual Environment: pip freeze output