Open seanharr11 opened 5 years ago
@seanharr11 Obviously testing this in functional tests as you've described is the true proof that ETLAlchemy is working as expected. I like the plan you've laid out and it seems a reasonable approach.
It appears that the links in 3 and 4 are swapped. Not hard to tell, but thought I'd note it.
In the migrate source
-> target
step, I think there are two techniques. To improve performance ETLAlchemy has the ability to generate import files for the target databases that speed the process (using database-native tools). If I followed correctly this is optional. Some database types use SQLAlchemy inserts. It might be helpful to add functional test steps that test this both ways (using SQLAlchemy directly and the native database tools).
I believe that it would also be helpful to test indexed columns, foreign keys, and other constraints. I believe ETLAlchemy intends to migrate these as well. (It would make sense to postpone this until after the basic column tests are in place, but keep it in mind.)
Because I've been working on unit tests, it makes me hope that most of what's proposed for functional tests can be done as unit tests before we get to the point of needing any databases. It seems to me that testing the actual database is the job of SQLAlchemy (SA) and we're relying on it to have already done this. Practically speaking, it might be quicker and easier to setup the functional tests than it is to build unit tests. Functional testing alone could make it difficult to track down where the failure is: whether in this code, in SQLAlchemy, in the connection, in the import file format, in the import processing, etc. Hopefully, the functional tests would at least be able to pin down failures to a particular column type or specific data examples. We are also working on more unit tests, so both are helpful.
The steps you've laid out in 1 are essentially the same steps that unit tests should do. Because ETLAlchemy is so tightly tied to SA, it may be that we need databases. I'm not sure if it's possible to simulate reading from and writing to various SA dialects. If SQLAlchemy's test suite has a way to do this, it would be helpful if we could leverage a similar technique in our tests. This would be a more lightweight (and probably quicker) way to test the code because database servers wouldn't need to be set up.
If SA is solid, it seems like all ETLAlchemy really needs to do and test is to make sure it maps each of the given source database's column types to the proper target database column types based on the various rules and overrides. We shouldn't need to test all that SA is doing (hopefully). Where ETLAlchemy side-steps SA, such as with the data files, we'll need to test that.
To give a more specific (made-up) example of a test, say we have a SQL Server source database with varchar(max) column and we want ETLAlchemy to send it to a target MySQL database. We need to make sure that it maps to and will generate the proper column type (perhaps text?) based on our rules and that it correctly formats the data for import (or does it through SA). Does this make sense?
Aside: I saw that you listed specific databases to be tested. I'm wondering if there's any particular reason ETLAlchemy can't handle all SQLAlchemy (SA) dialects so you can go anything to anything as long as there's a SA dialect for it. For simple SQL types I suspect it already works. (Note to self: try with more esoteric dialects.) Of course that's different than being supported and known to work. Did you think it needs to be restricted?
On Unit Tests
I think you're right here - let's leverage unit tests to assert that we are mapping column_types
correctly as follows:
def standardize_column_type(self, column, raw_rows)
standardize_column_type()
currently has a sneaky side-effect of mutating the raw_rows
passed to it, the method should probably return a tuple of (standardized_rows, column_type)
, as opposed to JUST column_type
.(source dialect, target_dialect, source_column_type)
:
sqlalchemy.type
column is created and returned by standardize_column_type()
standardize_column_type()
with something a bit more modular than a ridiculous if-statement. It'd be extremely explicit if we had 1 file/class per target dialect
, where each class implemented:
column_type_for(self, source_column_type: sqlalchemy.type) -> sqlalchemy.type
transformed_value_for(self, source_column_type: sqlalchemy.type, source_value: any) -> any
On Functional Tests
While we can likely lean on the very mature/stable SA for the schema translation above, as you and I have both mentioned, I think we want to start with functional tests for the following:
Also, you're right - ETLAlchemy should support any dialect
that SA supports...my initial round of manual testing (when I quickly open-sourced this 2-3 years ago) was only done on the "big-5" aforementioned dialects, that's why I make mention of them above.
Note: Changing this issue to "Testing Framework", as we can break it down to unit/functional issues if required for more specific specs.
How does this sound for a rough spec?
Objective: To build a framework to functionally test the following migration cases:
source
in ['sqlite', 'mysql', 'postgres', 'sql-server', 'oracle']:target
in ['sqlite', 'mysql', 'postgres', 'sql-server', 'oracle']:column_type: SQLAlchemy.types
supported bysource
(i.e. each column type)source
with one column of typecolumn_type
column_type
column_type
- somecolumns_types
have more possible testable values that othersSQLAlchemy.type
to be created on thetarget
source
->target
target
is of the expectedSQLAlchemy.type
Now ... this will likely yield many test cases, so we need to approach this incrementally. Once these are written, we will definitely want to re-factor the God-awful
if-statement
massacre that I wrote 4 years ago, and replace it with something more akin to dispatching between different column types.We will want to carry this out using Docker, and will need some
os.system(...)
calls to facilitate creating various Docker instances. The following contains my notes on how to bootstrap various RDBMS Docker instances on a Mac:Start Docker
Docker -> Preferences -> Advanced -> Memory -> Set to "> 4GB"
Download & Start Oracle 12c via Docker https://database.guide/how-to-install-sql-server-on-a-mac/
docker pull absolutapps/oracle-12c-ee
docker run -d --name oracle-12cR1-ee --privileged absolutapps/oracle-12c-ee
Download & Start SQL Server (for Linux) https://oraclespin.com/2018/03/30/docker-installation-of-oracle-database-12c-on-mac/
docker pull microsoft/mssql-server-linux
docker run -d --name sql_server_demo -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=reallyStrongPwd123' -p 1433:1433 microsoft/mssql-server-linux
Download and install Postgresql via Docker
docker pull postgres
docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres
Download and install MySQL 5.7.24 Via Docker
docker pull mysql:5.7.24
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:5.7.24
Download and install MySQL 8 Via Docker
docker pull mysql:8
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:5.8
@tpow ... thoughts?