Closed skrawcz closed 1 year ago
The "schema" of the DataStream after the aggregation is missing the mean columns and is thus incorrect.
mean
Run the hello world example:
from pyquokka.df import QuokkaContext qc = QuokkaContext() lineitem = qc.read_csv("lineitem.tbl.named", sep="|", has_header=True) d = lineitem.filter("l_shipdate <= date '1998-12-01' - interval '90' day") print(len(d.schema)) d = d.with_column("disc_price", lambda x: x["l_extendedprice"] * (1 - x["l_discount"]), required_columns ={"l_extendedprice", "l_discount"}) print(len(d.schema)) d = d.with_column("charge", lambda x: x["l_extendedprice"] * (1 - x["l_discount"]) * (1 + x["l_tax"]), required_columns={"l_extendedprice", "l_discount", "l_tax"}) print(len(d.schema)) f = d.groupby(["l_returnflag", "l_linestatus"], orderby=["l_returnflag","l_linestatus"]).agg({"l_quantity":["sum","avg"], "l_extendedprice":["sum","avg"], "disc_price":"sum", "charge":"sum", "l_discount":"avg","*":"count"}) print(f.schema) print(len(d.schema)) # <--- this is incorrect and only has 8 columns df = f.collect() print(df.columns) print(len(df.columns)) # <--- this has 10 columns and is correct
Is that when I get a DataStream, the "schema" it has should reflect what it would actually produce.
fixed with commit 73d4d3daa7fca2fdee6c4d82a638f95a7ee4581a
what
The "schema" of the DataStream after the aggregation is missing the
mean
columns and is thus incorrect.to reproduce
Run the hello world example:
what I expect
Is that when I get a DataStream, the "schema" it has should reflect what it would actually produce.