seanharr11 / etlalchemy

Extract, Transform, Load: Any SQL Database in 4 lines of Code.
MIT License
556 stars 83 forks source link

Testing Framework #45

Open seanharr11 opened 5 years ago

seanharr11 commented 5 years ago

Objective: To build a framework to functionally test the following migration cases:

  1. For each source in ['sqlite', 'mysql', 'postgres', 'sql-server', 'oracle']:
    1. For each target in ['sqlite', 'mysql', 'postgres', 'sql-server', 'oracle']:
      • For each column_type: SQLAlchemy.types supported by source (i.e. each column type)
        • Create a single table on source with one column of type column_type
        • Populate the table with the "n" rows of testable data values for that column_type
          • (i.e. Integer can be [None, 1, 2,...], Bool can be [True, None, False])
          • "n" is determined by the column_type - some columns_types have more possible testable values that others
        • Define the expected SQLAlchemy.type to be created on the target
        • Migrate source -> target
        • Assert that the column in the target is of the expected SQLAlchemy.type
        • Assert that the data in the column has been transformed/migrated correctly.

Now ... this will likely yield many test cases, so we need to approach this incrementally. Once these are written, we will definitely want to re-factor the God-awful if-statement massacre that I wrote 4 years ago, and replace it with something more akin to dispatching between different column types.

We will want to carry this out using Docker, and will need some os.system(...) calls to facilitate creating various Docker instances. The following contains my notes on how to bootstrap various RDBMS Docker instances on a Mac:

  1. Start Docker

  2. Docker -> Preferences -> Advanced -> Memory -> Set to "> 4GB"

  3. Download & Start Oracle 12c via Docker https://database.guide/how-to-install-sql-server-on-a-mac/ docker pull absolutapps/oracle-12c-ee docker run -d --name oracle-12cR1-ee --privileged absolutapps/oracle-12c-ee

  4. Download & Start SQL Server (for Linux) https://oraclespin.com/2018/03/30/docker-installation-of-oracle-database-12c-on-mac/ docker pull microsoft/mssql-server-linux docker run -d --name sql_server_demo -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=reallyStrongPwd123' -p 1433:1433 microsoft/mssql-server-linux

  5. Download and install Postgresql via Docker docker pull postgres docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres

  6. Download and install MySQL 5.7.24 Via Docker docker pull mysql:5.7.24 docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:5.7.24

  7. Download and install MySQL 8 Via Docker docker pull mysql:8 docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:5.8

@tpow ... thoughts?

tpow commented 5 years ago

@seanharr11 Obviously testing this in functional tests as you've described is the true proof that ETLAlchemy is working as expected. I like the plan you've laid out and it seems a reasonable approach.

It appears that the links in 3 and 4 are swapped. Not hard to tell, but thought I'd note it.

In the migrate source -> target step, I think there are two techniques. To improve performance ETLAlchemy has the ability to generate import files for the target databases that speed the process (using database-native tools). If I followed correctly this is optional. Some database types use SQLAlchemy inserts. It might be helpful to add functional test steps that test this both ways (using SQLAlchemy directly and the native database tools).

I believe that it would also be helpful to test indexed columns, foreign keys, and other constraints. I believe ETLAlchemy intends to migrate these as well. (It would make sense to postpone this until after the basic column tests are in place, but keep it in mind.)

Thoughts on Unit Tests

Because I've been working on unit tests, it makes me hope that most of what's proposed for functional tests can be done as unit tests before we get to the point of needing any databases. It seems to me that testing the actual database is the job of SQLAlchemy (SA) and we're relying on it to have already done this. Practically speaking, it might be quicker and easier to setup the functional tests than it is to build unit tests. Functional testing alone could make it difficult to track down where the failure is: whether in this code, in SQLAlchemy, in the connection, in the import file format, in the import processing, etc. Hopefully, the functional tests would at least be able to pin down failures to a particular column type or specific data examples. We are also working on more unit tests, so both are helpful.

The steps you've laid out in 1 are essentially the same steps that unit tests should do. Because ETLAlchemy is so tightly tied to SA, it may be that we need databases. I'm not sure if it's possible to simulate reading from and writing to various SA dialects. If SQLAlchemy's test suite has a way to do this, it would be helpful if we could leverage a similar technique in our tests. This would be a more lightweight (and probably quicker) way to test the code because database servers wouldn't need to be set up.

If SA is solid, it seems like all ETLAlchemy really needs to do and test is to make sure it maps each of the given source database's column types to the proper target database column types based on the various rules and overrides. We shouldn't need to test all that SA is doing (hopefully). Where ETLAlchemy side-steps SA, such as with the data files, we'll need to test that.

To give a more specific (made-up) example of a test, say we have a SQL Server source database with varchar(max) column and we want ETLAlchemy to send it to a target MySQL database. We need to make sure that it maps to and will generate the proper column type (perhaps text?) based on our rules and that it correctly formats the data for import (or does it through SA). Does this make sense?

Aside: I saw that you listed specific databases to be tested. I'm wondering if there's any particular reason ETLAlchemy can't handle all SQLAlchemy (SA) dialects so you can go anything to anything as long as there's a SA dialect for it. For simple SQL types I suspect it already works. (Note to self: try with more esoteric dialects.) Of course that's different than being supported and known to work. Did you think it needs to be restricted?

seanharr11 commented 5 years ago

On Unit Tests

I think you're right here - let's leverage unit tests to assert that we are mapping column_types correctly as follows:

  1. Determine if we are OK with the following method's interface: def standardize_column_type(self, column, raw_rows)
  2. Since it appears that standardize_column_type() currently has a sneaky side-effect of mutating the raw_rows passed to it, the method should probably return a tuple of (standardized_rows, column_type), as opposed to JUST column_type.
  3. Foreach (source dialect, target_dialect, source_column_type):
    1. Write a unit test to assert that the expected sqlalchemy.type column is created and returned by standardize_column_type()
    2. If custom data transformations are necessary, test that the data is transformed properly.
    3. Re-factor the implementation of standardize_column_type() with something a bit more modular than a ridiculous if-statement. It'd be extremely explicit if we had 1 file/class per target dialect, where each class implemented:
      • column_type_for(self, source_column_type: sqlalchemy.type) -> sqlalchemy.type
      • transformed_value_for(self, source_column_type: sqlalchemy.type, source_value: any) -> any

On Functional Tests

While we can likely lean on the very mature/stable SA for the schema translation above, as you and I have both mentioned, I think we want to start with functional tests for the following:

  1. Foreign Key migrations
  2. Index migrations
  3. Constraint migrations
  4. Import-file migrations (_i.e. optimized bulk imports: mysqlimport, pgload, etc...)
  5. Insert-statement migrations (for dialects that don't support (4))

Also, you're right - ETLAlchemy should support any dialect that SA supports...my initial round of manual testing (when I quickly open-sourced this 2-3 years ago) was only done on the "big-5" aforementioned dialects, that's why I make mention of them above.

Note: Changing this issue to "Testing Framework", as we can break it down to unit/functional issues if required for more specific specs.

How does this sound for a rough spec?