sryza / aas

Code to accompany Advanced Analytics with Spark from O'Reilly Media
Other
1.52k stars 1.03k forks source link

Chapter 1, page 46 #148

Closed tanthiamhuat closed 2 years ago

tanthiamhuat commented 2 years ago

`matchSummaryT.createOrReplaceTempView("match_desc")

missSummaryT.createOrReplaceTempView("miss_desc")

spark.sql("""

SELECT a.field, a.count + b.count total, a.mean - b.mean delta

FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field

WHERE a.field NOT IN ("id_1", "id_2")

ORDER BY delta DESC, total DESC

""").show()`

I tried follow your code from the beginning of Chapter 1, till that page 46, it has errors below: AnalysisException: [MISSING_COLUMN] Column 'a.field' does not exist. Did you mean one of the following? [a.id_1, a.id_2, a.cmp_bd, b.id_1, b.id_2, b.cmp_bd, a.cmp_bm, a.cmp_by, a.cmp_plz, b.cmp_bm, b.cmp_by, b.cmp_plz, a.cmp_sex, a.summary, b.cmp_sex, b.summary, a.cmp_fname_c1, a.cmp_fname_c2, b.cmp_fname_c1, b.cmp_fname_c2, a.cmp_lname_c1, a.cmp_lname_c2, b.cmp_lname_c1, b.cmp_lname_c2]; line 3 pos 46; 'Sort ['delta DESC NULLS LAST, 'total DESC NULLS LAST], true +- 'Project [('a.count + 'b.count) AS total#9710, ('a.mean - 'b.mean) AS delta#9711] +- 'Filter NOT 'a.field IN (id_1,id_2) +- 'Join Inner, ('a.field = 'b.field) :- SubqueryAlias a : +- SubqueryAlias match_desc : +- View (match_desc, [summary#7343,id_1#7344,id_2#7345,cmp_fname_c1#7346,cmp_fname_c2#7347,cmp_lname_c1#7348,cmp_lname_c2#7349,cmp_sex#7350,cmp_bd#7351,cmp_bm#7352,cmp_by#7353,cmp_plz#7354]) : +- LocalRelation [summary#7343, id_1#7344, id_2#7345, cmp_fname_c1#7346, cmp_fname_c2#7347, cmp_lname_c1#7348, cmp_lname_c2#7349, cmp_sex#7350, cmp_bd#7351, cmp_bm#7352, cmp_by#7353, cmp_plz#7354] +- SubqueryAlias b +- SubqueryAlias miss_desc +- View (miss_desc, [summary#9441,id_1#9442,id_2#9443,cmp_fname_c1#9444,cmp_fname_c2#9445,cmp_lname_c1#9446,cmp_lname_c2#9447,cmp_sex#9448,cmp_bd#9449,cmp_bm#9450,cmp_by#9451,cmp_plz#9452]) +- LocalRelation [summary#9441, id_1#9442, id_2#9443, cmp_fname_c1#9444, cmp_fname_c2#9445, cmp_lname_c1#9446, cmp_lname_c2#9447, cmp_sex#9448, cmp_bd#9449, cmp_bm#9450, cmp_by#9451, cmp_plz#9452]

Not sure if that a.field is defined.

srowen commented 2 years ago

What edition of the book? I don't see anything like this on page 46 of 2nd or 3rd edition

tanthiamhuat commented 2 years ago

image

srowen commented 2 years ago

@analyticalmonk can you take a look? this is page 30 in the 3rd edition. It might be helpful to get your code merged into this repo.

tanthiamhuat commented 2 years ago

so is there some mistakes somewhere? or do I miss some code somewhere?

analyticalmonk commented 2 years ago

@tanthiamhuat Can you please clarify the source of your screenshot? What is the name of the book chapter that it is from?

@srowen the above screenshot seems to be from the 2nd edition since the variable naming in the 3rd edition is different. Screenshot from 2022-07-05 10-21-07 I will nevertheless push the code to this repo in the next couple of days.

tanthiamhuat commented 2 years ago

Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark

tanthiamhuat commented 2 years ago

or do I post the issue wrongly, because I realized your github is for "Advanced Analytics with Spark". If so, sorry, then I delete the issue.

analyticalmonk commented 2 years ago

@tanthiamhuat You posted at the right place. Sorry about the confusion. "Advanced Analytics with PySpark" is the new edition of "Advanced Analytics with Spark".

It also looks like you are not using the latest version of the book. Where did you get access to the book from? If you have access to the O'Reilly platform, can you please check the code in the latest version and try to run it?

This repo doesn't have the code for "Advanced Analytics with PySpark" edition yet. I will push that within a day or two and let you know once that is done. You can then try to use the code from this repo if you don't have access to the latest book version.

tanthiamhuat commented 2 years ago

exactly from here (https://www.oreilly.com/library/view/advanced-analytics-with/9781098103644/) I get to your github link. I would wait for your updated code in this repo then. Thanks.

srowen commented 2 years ago

Ah right, this is also in the 2nd edition.

The code works correctly for me. Are you sure you are executing it as in the listing? the table match_desc does have a column field for sure.

tanthiamhuat commented 2 years ago

can I have the ISBN-10 and ISBN-13 for the correct or latest 3rd edition?

srowen commented 2 years ago

It's the one with Pyspark in the title, not Spark

tanthiamhuat commented 2 years ago

Solved, I realized it is some typo error in my function. Apologies. :(