mrchristine / db-migration

Databricks Migration Tools
Other
43 stars 27 forks source link

fix for delta handle #57

Closed gobiviswanath closed 3 years ago

gobiviswanath commented 3 years ago

added a fix for delta hadle

mrchristine commented 3 years ago

We need more information on this commit and if this actually solves the underlying issue. This commit blindly looks for Delta paths, sets a location, and would make all Delta tables un-managed tables which is a change in the behavior for a users table. A behavior change should be taken very carefully.

You can try to update your PR or submit a new one with more information.

gobiviswanath commented 3 years ago

Hi Miklos, I analysed delta behaviour. You can create a delta as managed table only at the first time. Starting second time, Every time you migrate it will be external table.

gobiviswanath commented 3 years ago
  1. When a custom path for a Database is used. a. Is the table managed or unmanaged managed datble - fails migration external table - succeeds migration b. Does Delta allow the path to be pre-populated for managed? managed - it does not allow. it fails with error Cannot create table ('deltaManaged'). The associated location ('adls path') is not empty. c. Does Delta allow the path to be pre-populated for unmanaged? External table - it does not allow but the create statements generated by migrate script have exclusive location defenition in ddl . so it does succeed.
  2. Does a default Database path behave the same? I suppose default you mean path is (dbfs:/user/hive/warehouse) a) other external tables orc, parquet in default works fine or atleast after msck or fsck repair table restores the table state. b) One more thing about default database is the delta table is mostly external. ( path is abfs or adl or wasb). so it does succeed because ddl has explicit path for table to be constructed from. c) If a delta table is managed table in default database just like any other managed table, orc or parquet. will fail as root dbfs data needs shipped which is mostly manual intervention which our tool does not take care today.
gobiviswanath commented 3 years ago

as for evaluation logic

a) I consider describing every table and access type of table by doing a describe. But it is extra API call. Overhead considering if the tables are too much/ b) The current loop enters by evaluating ddl language generated by previous "show create statement" and process only managed delta tables. Then do en extended describe and append location. It does describe call only it it is necessary which is more efficient approach I thought.