Open RobinL opened 3 years ago
Interesting, it must be the ctas_approach
from wrangler but I dno if we might be causing it by setting a temp database?
Some more notes:
I was running into this issue again today. This time I was getting the right columns but in the wrong order.
Again, running the query generated by pydbtools in Athena gave the right result:
select c.postcode, d.pc_nospace, d.count_distinct_person, c.codes from mojap_de_temp_1613136638555168000.pc_with_distinct_persons as d left join mojap_de_temp_1613136638555168000.pc_with_codes as c on d.pc_nospace = c.pc_nospace
order by d.count_distinct_person desc;
Further inspection suggests I was getting all the right data but in the wrong order. (using ctas_approach =True
)
This makes sense if using the arrow reader, because presumably it reads the parquet files into pandas in any order.
I also ran the query directly in wranger.athena.read_sql_query
and got the same result i.e. this doesn't seem to be an issue with the manipulations pydbtools
does.
A good start to further troubleshoot this would be to put some logging in to wrangler
at the DEBUG level which outputs
1.. the SQL statement executed and
I've run into an issue a few times where I run a query usin
read_sql_query
and get completely different results frmo the Athena GUI.Restarting kernel and re-running fixes it.
I think this is likely to be a result of caching behaviour - possibly from
awswrangler
.I haven't yet been able to create a fully reproducible example. The query I was running this morning was:
Note: This isn't simply a result of different rows being returned (which might be expected due to the distributed nature of the query). The query was returning the wrong columns and the wrong number of rows