schireson / pytest-mock-resources

Pytest Fixtures that let you actually test against external resource (Postgres, Mongo, Redshift...) dependent code.
https://pytest-mock-resources.readthedocs.io/en/latest/quickstart.html
MIT License
179 stars 19 forks source link

DISTKEY not supported #99

Open erikalfthan opened 3 years ago

erikalfthan commented 3 years ago

Describe the bug One of the main changes in redshift vs postgres is how index/keys are used. More specifically, one can specify a DISTKEY and SORTKEY for a table like

CREATE TABLE myschema.mytable ( col1 INT ,col2 VARCHAR(20) ) DISTKEY(col1) SORTKEY(col1, col2);

This can improve performance by helping redshift to keep joins on a single node, skip partitions etc

For my tests, I don't care about the performance, but I need the production SQL to look like this, and I want my tests to test actual code, not a similar but not the same variation.

It would make this mock library more complete and should probably be handled like UNLOAD etc. NB: I am a newbie using this library, so have haven't checked how this is handled.

Environment

(Also in requirements.txt before trying pytest-mock-resources SQLAlchemy psycopg2 sqlalchemy-redshift pandas

To Reproduce Steps to reproduce the behavior:

  1. Create a test case executing sql

redshift = create_redshift_fixture() def test_redshift_with_distkey(redshift): with redshift.connect() as conn: conn.execute('create table table1 ( col1 int, col2 varchar(20)) distkey(col1) sortkey(col1, col2);')

  1. Run pytest
  2. See error Error is logged and test is aborted (psycopg2.errors.SyntaxError) syntax error at or near "DISTKEY"

Expected behavior Since DISTKEY and SORTKEY probably arent relevant in test cases, ignore / remove this part of SQL before executing in docker postgres

Actual Behavior Described above - can't disclose much more anyways.

Additional context Running simpler SQL works in my current setup

Seems related to #82

DanCardin commented 3 years ago

If it helps, the reason we haven't run into this ourselves is because we use sqlalchemy-redshift + alembic + sqlalchemy tables/models to model our databases. Alembic (with sqlalchemy-redshift) will produce the distkey/sortkeys if you put them in sqlalchemy's __table_args__ (on a model), but pytest-mock-resources will directly use the models to create the tables and ignores them.

If that's not an acceptable workflow, the problem with this request is that you're executing raw SQL. While I'm not sure that i.e. a CreateTable command using sqlalchemy-redshift (with distkey set) would perform any better currently, it's at least something we could likely add robust support for because it's a structured single-purpose command that the redshift dialect supports at the sqlalchemy level.

With raw SQL, my impression is that sqlalchemy just sends your text verbatim, so there's much less we can do in a 100% correct way. We could, attempt somewhat hacky regex checks through the queries to parse out distkey/sortkey and either translate or remove them, but i wouldn't expect it to be particularly robust (for example if DISTKEY resided in some string)

erikalfthan commented 3 years ago

The performance is not gained on the CREATE TABLE statement, but can be huge on a big cluster if it ensures that subsequent JOINs never leaves a particular node.

My reason for trying this library was that it seems that no one are able to mock redshift :)

The project Im working on actually still uses Pentaho PDI with JDBC to drive most of the ETL, even if almost all functional code has left PDI, so unfortunately I can't use anything else than raw SQL and wanted to avoid the hacky regex solution. I didnt look into how you had implemented this before I tried to use it.

So, thanks for a quick reply, you can choose to close this as out of scope or similar. Maybe add a comment in the documentation that you dont support raw sql DISTKEY and SORTKEY where it is stated that you do support UNLOAD.

DanCardin commented 3 years ago

To be clear, I'm not necessarily against limited parsing the SQL statement, especially because it could likely be scoped to i.e. CREATE TABLE statements. And fyi, this is how we support COPY/UNLOAD (through a combination of specific sqlalchemy-redshift handling and manual parsing of COPY/UNLOAD statements and altering the query at runtime).

So ultimately, any support we add for raw sql statements for this is going to be doing the "hacky regex solution" (to essentially just remove them from the query string), which is honestly probably fine for most cases.

But as you've already noticed we don't support it today. If you were so inclined, I'd be happy to direct you to the relevant locations for adding support yourself, else we could certainly add support, but it wouldn't be as quick as my issue responses 😛.