mrchristine / db-migration

Databricks Migration Tools
Other
43 stars 27 forks source link

patch-for-table-ddl in metastore export! #60

Closed gobiviswanath closed 3 years ago

gobiviswanath commented 3 years ago

Hi Miklos, thank you for the constructive feedback. I cleansed the code and updated commit. Please review and let me know if it still needs improvements.

Problem:

The table exports that does not have location or path in ddl definition fails at destination fails during import. I added a fix to take care of this.

Here is the workspace with mixed set of ddls with variety of tables, managed, external, csv, orc, delta, parquet and hive table.

https://adb-5985677774820244.4.azuredatabricks.net/?o=5985677774820244#setting/clusters

If the tables are exported in this workspace with current migrate script :

1) it would fail at destination with empty state. I add Charles Bernard as witness. charles.bernard@databricks.com and we had to make this change to migrate Prudencial at APAC. We migrated with this script. as poc. 2) With the patch this is fixed.

If you are testing this, please import spark config from migrating cluster (API_Metastore_Work_Leave_Me_Alone) from source to destination workspace.

resolved issue: bug! delta and external tables in metastore import fails #54

Thanks again in advance!

gobiviswanath commented 3 years ago

Here is the workspace with mixed set of ddls with variety of tables, managed, external, csv, orc, delta, parquet and hive table.

Source: https://adb-5985677774820244.4.azuredatabricks.net/?o=5985677774820244#setting/clusters Destination: https://adb-8237512505927139.19.azuredatabricks.net/?o=8237512505927139#setting/clusters

One more thing to add: this is tested in

6.4 DBR in source workspace. Python 3.7.4 6.4 DBR in destination workspace.

gobiviswanath commented 3 years ago

I agree the managed tables are converted to external tables but this is better than the tables not being migrated at all. we can add notes in readme defining the limitation.

mrchristine commented 3 years ago

Thanks for commit and tests. The main challenge with this PR is the additional API requests to fetch the LOCATION data for the tables. This is not needed since we have all the information locally, and an additional API request per table is very costly for a migration.

I have a tested version working now that I will commit. It tests whether the database path is custom + the table DDL is missing the LOCATION attribute, if it is missing the LOCATION attribute, then I append it to the DDL before importing.