Closed tanthiamhuat closed 2 years ago
What edition of the book? I don't see anything like this on page 46 of 2nd or 3rd edition
@analyticalmonk can you take a look? this is page 30 in the 3rd edition. It might be helpful to get your code merged into this repo.
so is there some mistakes somewhere? or do I miss some code somewhere?
@tanthiamhuat Can you please clarify the source of your screenshot? What is the name of the book chapter that it is from?
@srowen the above screenshot seems to be from the 2nd edition since the variable naming in the 3rd edition is different. I will nevertheless push the code to this repo in the next couple of days.
Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark
or do I post the issue wrongly, because I realized your github is for "Advanced Analytics with Spark". If so, sorry, then I delete the issue.
@tanthiamhuat You posted at the right place. Sorry about the confusion. "Advanced Analytics with PySpark" is the new edition of "Advanced Analytics with Spark".
It also looks like you are not using the latest version of the book. Where did you get access to the book from? If you have access to the O'Reilly platform, can you please check the code in the latest version and try to run it?
This repo doesn't have the code for "Advanced Analytics with PySpark" edition yet. I will push that within a day or two and let you know once that is done. You can then try to use the code from this repo if you don't have access to the latest book version.
exactly from here (https://www.oreilly.com/library/view/advanced-analytics-with/9781098103644/) I get to your github link. I would wait for your updated code in this repo then. Thanks.
Ah right, this is also in the 2nd edition.
The code works correctly for me. Are you sure you are executing it as in the listing? the table match_desc
does have a column field
for sure.
can I have the ISBN-10 and ISBN-13 for the correct or latest 3rd edition?
It's the one with Pyspark in the title, not Spark
Solved, I realized it is some typo error in my function. Apologies. :(
`matchSummaryT.createOrReplaceTempView("match_desc")
missSummaryT.createOrReplaceTempView("miss_desc")
spark.sql("""
SELECT a.field, a.count + b.count total, a.mean - b.mean delta
FROM match_desc a INNER JOIN miss_desc b ON a.field = b.field
WHERE a.field NOT IN ("id_1", "id_2")
ORDER BY delta DESC, total DESC
""").show()`
I tried follow your code from the beginning of Chapter 1, till that page 46, it has errors below:
AnalysisException: [MISSING_COLUMN] Column 'a.field' does not exist. Did you mean one of the following? [a.id_1, a.id_2, a.cmp_bd, b.id_1, b.id_2, b.cmp_bd, a.cmp_bm, a.cmp_by, a.cmp_plz, b.cmp_bm, b.cmp_by, b.cmp_plz, a.cmp_sex, a.summary, b.cmp_sex, b.summary, a.cmp_fname_c1, a.cmp_fname_c2, b.cmp_fname_c1, b.cmp_fname_c2, a.cmp_lname_c1, a.cmp_lname_c2, b.cmp_lname_c1, b.cmp_lname_c2]; line 3 pos 46; 'Sort ['delta DESC NULLS LAST, 'total DESC NULLS LAST], true +- 'Project [('a.count + 'b.count) AS total#9710, ('a.mean - 'b.mean) AS delta#9711] +- 'Filter NOT 'a.field IN (id_1,id_2) +- 'Join Inner, ('a.field = 'b.field) :- SubqueryAlias a : +- SubqueryAlias match_desc : +- View (
match_desc, [summary#7343,id_1#7344,id_2#7345,cmp_fname_c1#7346,cmp_fname_c2#7347,cmp_lname_c1#7348,cmp_lname_c2#7349,cmp_sex#7350,cmp_bd#7351,cmp_bm#7352,cmp_by#7353,cmp_plz#7354]) : +- LocalRelation [summary#7343, id_1#7344, id_2#7345, cmp_fname_c1#7346, cmp_fname_c2#7347, cmp_lname_c1#7348, cmp_lname_c2#7349, cmp_sex#7350, cmp_bd#7351, cmp_bm#7352, cmp_by#7353, cmp_plz#7354] +- SubqueryAlias b +- SubqueryAlias miss_desc +- View (
miss_desc, [summary#9441,id_1#9442,id_2#9443,cmp_fname_c1#9444,cmp_fname_c2#9445,cmp_lname_c1#9446,cmp_lname_c2#9447,cmp_sex#9448,cmp_bd#9449,cmp_bm#9450,cmp_by#9451,cmp_plz#9452]) +- LocalRelation [summary#9441, id_1#9442, id_2#9443, cmp_fname_c1#9444, cmp_fname_c2#9445, cmp_lname_c1#9446, cmp_lname_c2#9447, cmp_sex#9448, cmp_bd#9449, cmp_bm#9450, cmp_by#9451, cmp_plz#9452]
Not sure if that a.field is defined.