marsupialtail / quokka

Making data lake work for time series
https://marsupialtail.github.io/quokka/
Apache License 2.0
1.14k stars 60 forks source link

exceptions.ShapeError: 17 column names provided for a dataframe of width 16 #48

Closed Jobhdez closed 1 year ago

Jobhdez commented 1 year ago

I am following this tutorial and I get the following exception:

   exceptions.ShapeError: 17 column names provided for a dataframe of width 16

lineitem.count() also throws another exception, namely,

   exceptions.ShapeError: 17 column names provided for a dataframe of width 16

and here is the code:

from pyquokka.df import * 
qc = QuokkaContext()

disk_path = "Downloads/demo-tpch/"
# the last column is called NULL, because the TPC-H data generator likes to put a | at the end of each row, making it appear as if there is a final column
# with no values. Don't worry, we can drop this column. 
lineitem_scheme = ["l_orderkey","l_partkey","l_suppkey","l_linenumber","l_quantity","l_extendedprice", "l_discount","l_tax","l_returnflag","l_linestatus","l_shipdate","l_commitdate","l_receiptdate","l_shipinstruct","l_shipmode","l_comment", "null"]
#lineitem = qc.read_csv(disk_path + "lineitem.tbl", sep="|", has_header=True)
lineitem = qc.read_csv(disk_path + "lineitem.tbl", lineitem_scheme, sep="|")
orders = qc.read_csv(disk_path + "orders.tbl", sep="|", has_header=True)
customer = qc.read_csv(disk_path + "customer.tbl",sep = "|", has_header=True)
part = qc.read_csv(disk_path + "part.tbl", sep = "|", has_header=True)
supplier = qc.read_csv(disk_path + "supplier.tbl", sep = "|", has_header=True)
partsupp = qc.read_csv(disk_path + "partsupp.tbl", sep = "|", has_header=True)
nation = qc.read_csv(disk_path + "nation.tbl", sep = "|", has_header=True)
region = qc.read_csv(disk_path + "region.tbl", sep = "|", has_header=True)

lineitem.count()

I am using Arch Linux and redis-7.0.10-1.

marsupialtail commented 1 year ago

Resolved