mrchristine / db-migration

Databricks Migration Tools
Other
43 stars 27 forks source link

delta and external tables in metastore import fails #54

Closed gobiviswanath closed 3 years ago

gobiviswanath commented 3 years ago

We used this tool to migrate delta table. There is issue with delta table import:

1) There is a bug in migration tool which missed location path in create statement where table creation fails while doing metastore migration. The exact details are highlighted in the screenshots.

The migration tool generates delta import command in below format but it fails

CREATE TABLE events
  USING DELTA

Error:

ERROR:

org.apache.spark.sql.AnalysisException: Cannot create table ('`test_db`.`test_table1`'). The associated location ('dbfs:/mnt/mountdbfs1/ods/test_table1') is not empty.;
{'resultType': 'error', 'summary': 'org.apache.spark.sql.AnalysisException: Cannot create table ('`test_db`.`test_table1`'). The associated location ('dbfs:/mnt/mountdbfs1/ods/test_table1') is not empty.;', 'cause': '---------------------------------------------------------------------------\nPy4JJavaError                             Traceback (most recent call last)\n/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)\n     62         try:\n---> 63             return f(*a, **kw)\n     64         except py4j.protocol.Py4JJavaError as e:\n\n/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)\n    327                     "An error occurred while calling {0}{1}{2}.\\n".\n--> 328                     format(target_id, ".", name), value)\n    329             else:\n\nPy4JJavaError: An error occurred while calling o210.sql.\n: org.apache.spark.sql.AnalysisException: Cannot create table (\'`test_db`.`test_table1`\'). The associated location (\'dbfs:/mnt/mountdbfs1/ods/test_table1\') is not empty.;\n\tat com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.com$databricks$sql$transaction$tahoe$commands$CreateDeltaTableCommand$$assertPathEmpty(CreateDeltaTableCommand.scala:186)\n\tat com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand$$anonfun$run$2.apply(CreateDeltaTableCommand.scala:136)\n\tat com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand$$anonfun$run$2.apply(CreateDeltaTableCommand.scala:93)\n\tat com.databricks.logging.UsageLogging$$anonfun$recordOperation$1.apply(UsageLogging.scala:428)\n\tat com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)\n\tat scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)\n\tat com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)\n\tat com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:18)\n\tat com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:275)\n\tat com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:18)\n\tat com.databricks.logging.UsageLogging$class.recordOperation(UsageLogging.scala:409)\n\tat com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:18)\n\tat com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:55)\n\tat 

The correct command should be

CREATE TABLE events
  USING DELTA
  LOCATION '/mnt/delta/events'

2) There are also issues with some external tables after migration which have .seq.gz files as extensions which need to be fixed as well. They return empty dataset.

mrchristine commented 3 years ago

@gobiviswanath thanks for reporting this.

  1. It looks like the DDL you provided is not the table that's causing the issue. Please go into logs/metastore/{db_name}/table_name and provide the DDL for the table that's throwing the error.

  2. For the second issue in question, please review the exported metastore entries and we can look into the fix if needed. Have you tried the --metastore-unicode option for the metastore export to ensure that control characters are captured? I'll need to know the export command used here.

mrchristine commented 3 years ago

It looks like a bug w/ Azure using Delta CTAS statements when we export the DDL. I couldn't reproduce this on AWS. Could you open an eng ticket internally and have them take a look?

gobiviswanath commented 3 years ago

Well, this does not look like a bug with DDL export itself, if we are creating a delta table without specifying location and allow table inherit location from database path.

%sql show create table image

will miss the location. This seem like expected behaviour.

When we use this

CREATE TABLE `test_delta_externalDB`.`customers_adlspath` ( `c_custkey` BIGINT, `c_name` STRING, `c_address` STRING, `c_nationkey` BIGINT, `c_phone` STRING, `c_acctbal` DECIMAL(12,2), `c_comment` STRING, `c_mktsegment` STRING) USING DELTA

to create table at destination destination, it fails. This is mainly because

a) databases migrated failed to inherit path / location from source workspace issue#56

b) We do not update the path/ dp path checks in create statements we generate by describing the table at source. The delta tables explicitly require location for shallow creation of tables.

mrchristine commented 3 years ago

Closing this as it's related to #56 . Please re-open a new issue if you run into issues with the latest changes.

gobiviswanath commented 3 years ago

Still an issue with delta tables have contributed a fix: attached the file here could you please review and commit the fix?

HiveClient.txt

gobiviswanath commented 3 years ago

The fix includes describing the table location and appending to the trailing path of ddl commands that is missing location for delta tables. I specifically add a check for delta because other tables need testing.

gobiviswanath commented 3 years ago

It looks like I do not have access to push so attached here.

mrchristine commented 3 years ago

It's unclear why we need your patch since the #56 was not address because there's an issue with Azure serializing the function. This fix doesn't address the issue.

mrchristine commented 3 years ago

I'd also recommend you look at forking a GitHub repo and how to submit pull requests.

gobiviswanath commented 3 years ago

The fix is needed because delta import at destination fails after export:

Get: https://adb-4425252124055355.15.azuredatabricks.net/api/1.2/commands/status ERROR: org.apache.spark.sql.AnalysisException: Cannot create table ('deltadb3.orders_dbfspath'). The associated location ('abfss://gobidatagen@adlsgen2passthroughtest.dfs.core.windows.net/deltaDB3/orders_dbfspath') is not empty.; {'resultType': 'error', 'summary': 'org.apache.spark.sql.AnalysisException: Cannot create table ('deltadb3.orders_dbfspath'). The associated location ('abfss://gobidatagen@adlsgen2passthroughtest.dfs.core.windows.net/deltaDB3/orders_dbfspath') is not empty.;', 'cause': '---------------------------------------------------------------------------\nPy4JJavaError Traceback (most recent call last)\n/databricks/spark/python/pyspark/sql/utils.py in deco(*a, *kw)\n 62 try:\n---> 63 return f(a, **kw)\n 64 except py4j.protocol.Py4JJavaError as e:\n\n/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)\n 327 "An error occurred while calling {0}{1}{2}.\n".\n--> 328 format(target_id, ".", name), value)\n 329 else:\n\nPy4JJavaError: An error occurred while calling o211.sql.\n: org.apache.spark.sql.AnalysisException: Cannot create table (\'deltadb3.orders_dbfspath\'). The associated location (\'abfss://gobidatagen@adlsgen2passthroughtest.dfs.core.windows.net/deltaDB3/orders_dbfspath\') is not empty