moj-analytical-services / pydbtools

Python version of dbtools
https://moj-analytical-services.github.io/pydbtools/
10 stars 2 forks source link

Data types and speed #15

Open RobinL opened 4 years ago

RobinL commented 4 years ago

Data types and speed

The current implmentation saves results out to csv in s3 (Athena's default behaviour) and then reads in from s3.

However, it is possible to save results out to parquet using a create table as statement.

This has two benefits:

One potential issue with this approach is that the user must submit a select statement (not e.g. a delete table statement). So, if we're worried about this, we would need to somehow parse the sql statement to make sure it's a select statement.

I previously had a very rough go at this here, which does work in most situations, but it's very rough and ready.

Once we've done this, we should probably deprecate the python_athena_tools repo.

isichei commented 4 years ago

Additional notes after chat:

isichei commented 4 years ago

Should fix #17