sijocherian / google-bigquery

Automatically exported from code.google.com/p/google-bigquery
0 stars 0 forks source link

Left join is duplicating rows in the lefthand table #99

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. LEFT JOIN EACH on two tables produces duplication of every row 
2. The query is reproduced here:

SELECT UID, Price, date_of_transfer, postcode, property_type, new, tenure_type, 
PAON, SAON, street, locality, town, district, county, record_status, postcode8, 
LSOA11CD, LSOA11NM, MSOA11CD, MSOA11NM, LAD11CD, LAD11NM, LAD11NMW, PCDOASPLT, 
CCG_codeDH, CCG_code, CCG_name
FROM [Census_and_geographic_data.house_price_transactions3] as transactions

LEFT JOIN EACH [Census_and_geographic_data.Postcode_to_LA_CCG_MSOA_LSOA01] as 
lookups
ON transactions.postcode = lookups.normal_postcode;

Job details:
Job ID: pa-nhs-prescription:job_JOxbxkNdVHHnhV3QyAX-x56ygfw
Start Time: 5:41pm, 25 May 2014
End Time: 5:42pm, 25 May 2014
Bytes Processed: 3.26 GB
Destination Table: 
pa-nhs-prescription:Census_and_geographic_data.House_transactions_with_lookups
Write Preference: Overwrite table
Allow Large Results: true
Priority: Interactive

3. The result should contain about 19m rows and actually contains about 38m. 
Examination shows every unique row (identified by UID) is reproduced twice.

What is the expected output? What do you see instead?
The result should contain one row for each row in the left table with the extra 
matching fields from the right table. But it now contains two copies of each 
row with the extra matching fields.

What version of the product are you using? On what operating system?
The problem was originally noticed while testing Tableau for the Mac version 
8.2 beta 4 but was reproduced by running a similar query inside BigQuery using 
the Web interface to BigQuery.

Please provide any additional information below.

Original issue reported on code.google.com by stephen....@paconsulting.com on 25 May 2014 at 5:04

GoogleCodeExporter commented 9 years ago
It looks like the "lookups" table has two entries for each join key, so I think 
the query engine is doing the right thing. Perhaps you can filter out the 
duplicate entries in the lookup table? You might also want to reload or 
regenerate that lookup table (if possible) to ensure that it contains the 
correct data.

I'm going to close this bug for housekeeping purposes, but feel free to reopen 
if you need more assistance!

Original comment by jcon...@google.com on 25 May 2014 at 9:43