gregrahn / tpcds-kit

TPC-DS benchmark kit with some modifications/fixes
317 stars 200 forks source link

1 Column missed in s_web_returns #22

Closed eisenwang closed 5 years ago

eisenwang commented 7 years ago

Dear all, When I'm preparing the refresh data, I created the source table with tpcds_source.sql. In this sql file, there's 1 table named s_web_returns. The statement is -- create table s_web_returns ( wret_web_site_id char(16) , wret_order_id integer not null, wret_line_number integer not null, wret_item_id char(16) not null, wret_return_customer_id char(16) , wret_refund_customer_id char(16) , wret_return_date char(10) , wret_return_time char(10) , wret_return_qty integer , wret_return_amt numeric(7,2) , wret_return_tax numeric(7,2) , wret_return_fee numeric(7,2) , wret_return_ship_cost numeric(7,2) , wret_refunded_cash numeric(7,2) , wret_reversed_charge numeric(7,2) , wret_account_credit numeric(7,2) , wret_reason_id char(16) );

But while creating the fact view wrv on it, we can see an error -- Column 'wret_web_page_id' not found, and by check for wrv's statement --

CREATE VIEW wrv AS SELECT d_date_sk wr_return_date_sk ,t_time_sk wr_return_time_sk ,i_item_sk wr_item_sk ,c1.c_customer_sk wr_refunded_customer_sk ,c1.c_current_cdemo_sk wr_refunded_cdemo_sk ,c1.c_current_hdemo_sk wr_refunded_hdemo_sk ,c1.c_current_addr_sk wr_refunded_addr_sk ,c2.c_customer_sk wr_returning_customer_sk ,c2.c_current_cdemo_sk wr_returning_cdemo_sk ,c2.c_current_hdemo_sk wr_returning_hdemo_sk ,c2.c_current_addr_sk wr_returing_addr_sk ,wp_web_page_sk wr_web_page_sk ,r_reason_sk wr_reason_sk ,wret_order_id wr_order_number ,wret_return_qty wr_return_quantity ,wret_return_amt wr_return_amt ,wret_return_tax wr_return_tax ,wret_return_amt + wret_return_tax AS wr_return_amt_inc_tax ,wret_return_fee wr_fee ,wret_return_ship_cost wr_return_ship_cost ,wret_refunded_cash wr_refunded_cash ,wret_reversed_charge wr_reversed_charge ,wret_account_credit wr_account_credit ,wret_return_amt+wret_return_tax+wret_return_fee -wret_refunded_cash-wret_reversed_charge-wret_account_credit wr_net_loss FROM s_web_returns LEFT OUTER JOIN date_dim ON (cast(wret_return_date as date) = d_date) LEFT OUTER JOIN time_dim ON ((CAST(SUBSTR(wret_return_time,1,2) AS integer)3600 +CAST(SUBSTR(wret_return_time,4,2) AS integer)60+CAST(SUBSTR(wret_return_time,7,2) AS integer))=t_time) LEFT OUTER JOIN item ON (wret_item_id = i_item_id) LEFT OUTER JOIN customer c1 ON (wret_return_customer_id = c1.c_customer_id) LEFT OUTER JOIN customer c2 ON (wret_refund_customer_id = c2.c_customer_id) LEFT OUTER JOIN reason ON (wret_reason_id = r_reason_id) LEFT OUTER JOIN web_page ON (wret_web_page_id = WP_WEB_PAGE_id) WHERE i_rec_end_date IS NULL AND wp_rec_end_date IS NULL;

But there's no wret_web_page_id column defined in s_web_returns. So would you please have a look and also modify the refresh data generator to add this column?

Thanks in advance for your help.

Regards Eisen

eisenwang commented 7 years ago

And also puzzled on the difference between this refresh data list and the official document list -- In "TPC BENCHMARK ™ DS Standard Specification Version 2.3.0" Table 5-4, we can see -- only s_catalog_returns, s_catalog_sales, s_inventory, s_store_returns, s_purchase_lineitem, s_web_returns, s_web_order_lineitem -- 7 source tables. But in refresh data, there are 21 tables -- asiq160@bigDataIQ:/iqdata/dump/refresh> ls s*.dat s_call_center_1.dat s_customer_address_1.dat s_store_1.dat s_web_returns_1.dat s_catalog_order_1.dat s_inventory_1.dat s_store_returns_1.dat s_web_site_1.dat s_catalog_order_lineitem_1.dat s_item_1.dat s_warehouse_1.dat s_zip_to_gmt_1.dat s_catalog_page_1.dat s_promotion_1.dat s_web_order_1.dat s_catalog_returns_1.dat s_purchase_1.dat s_web_order_lineitem_1.dat s_customer_1.dat s_purchase_lineitem_1.dat s_web_page_1.dat

And in the refresh data, there's no s_catalog_sales data file either...

Would you please tell me what use for the other source tables and how to fix s_catalog_sales data file loss? Thanks

Regards Eisen

gregrahn commented 7 years ago

Looks like the issue is that tpcds_source.sql has the first column of s_web_returns as wret_web_site_id, but the spec has it as wret_web_page_id. See page 97 Table A-18: Column definition s_web_returns.

eisenwang commented 7 years ago

Dear Gregrahn,

Thanks a lot. Now the wrv is right. But still no s_catalog_sales refresh data generated... Please kind help. Thanks

Regards Eisen

gregrahn commented 7 years ago

It looks like Table 5-4 in the spec has an error in the name s_catalog_sales -- this should be s_catalog_order since the view csv contains the table s_catalog_order and Table 5-4 is the only occurrence of s_catalog_sales in the entire spec. Also see Appendix A. noting the cord_* columns referenced in the csv view belong to table s_catalog_order.

eisenwang commented 7 years ago

Oh... Got it. Thank you very much

gregrahn commented 5 years ago

Fixed via #42