moj-analytical-services / pydbtools

Python version of dbtools
https://moj-analytical-services.github.io/pydbtools/
10 stars 2 forks source link

Query caching issues - query returning results different to Athena GUI #41

Open RobinL opened 3 years ago

RobinL commented 3 years ago

I've run into an issue a few times where I run a query usin read_sql_query and get completely different results frmo the Athena GUI.

Restarting kernel and re-running fixes it.

I think this is likely to be a result of caching behaviour - possibly from awswrangler.

I haven't yet been able to create a fully reproducible example. The query I was running this morning was:

sql = """
select *
from data_linking_temp.nomis_csv_qa
limit 10
"""

df = pydb.read_sql_query(sql)

Note: This isn't simply a result of different rows being returned (which might be expected due to the distributed nature of the query). The query was returning the wrong columns and the wrong number of rows

isichei commented 3 years ago

Interesting, it must be the ctas_approach from wrangler but I dno if we might be causing it by setting a temp database?

RobinL commented 3 years ago

Some more notes:

I was running into this issue again today. This time I was getting the right columns but in the wrong order.

Again, running the query generated by pydbtools in Athena gave the right result:

select c.postcode, d.pc_nospace, d.count_distinct_person, c.codes from mojap_de_temp_1613136638555168000.pc_with_distinct_persons as d left join mojap_de_temp_1613136638555168000.pc_with_codes as c on d.pc_nospace = c.pc_nospace 
order by d.count_distinct_person desc;

Further inspection suggests I was getting all the right data but in the wrong order. (using ctas_approach =True)

This makes sense if using the arrow reader, because presumably it reads the parquet files into pandas in any order.

I also ran the query directly in wranger.athena.read_sql_query and got the same result i.e. this doesn't seem to be an issue with the manipulations pydbtools does.

RobinL commented 3 years ago

A good start to further troubleshoot this would be to put some logging in to wrangler at the DEBUG level which outputs 1.. the SQL statement executed and

  1. the args which are submitted to AWS wrangler