mrchristine / db-migration

Databricks Migration Tools
Other
43 stars 27 forks source link

export metastore using existing cluster? #27

Closed saldroubi closed 4 years ago

saldroubi commented 4 years ago

I noticed that when doing an export with --metastore it creates a cluster with a name stored in the data/azure_cluster.json. However, I would like it to use an existing cluster in my workspace. Is this possible?

I changed the cluster name and version to a cluster I crated as shown below but I getting an error. Here is the azure_cluster.json file followed by the error.

{ "num_workers": 1, "cluster_name": "export", "spark_version": "7.0.x-scala2.12", "spark_conf": {}, "node_type_id": "Standard_F4s_v2", "ssh_public_keys": [], "custom_tags": {}, "spark_env_vars": { "PYSPARK_PYTHON": "/databricks/python3/bin/python3" }, "autotermination_minutes": 30, "init_scripts": [] }

######################## ERROR #######################################

python3 ./export_db.py --azure --metastore --debug --profile ws-databricks

https://adb-5463815377663355.15.azuredatabricks.net dapi6fe373fdb9f18c9613840ebefdccde43 Export the metastore configs at 2020-09-15 17:16:17.354008 Get: https://adb-5463815377663355.15.azuredatabricks.net/api/2.0/clusters/list Get: https://adb-5463815377663355.15.azuredatabricks.net/api/2.0/clusters/list Starting export with id 0915-220257-ruins233 post: https://adb-5463815377663355.15.azuredatabricks.net/api/2.0/clusters/start Error: Cluster 0915-220257-ruins233 is in unexpected state Running. Get: https://adb-5463815377663355.15.azuredatabricks.net/api/2.0/clusters/get Cluster creation time: 0:00:00.662221 Creating remote Spark Session post: https://adb-5463815377663355.15.azuredatabricks.net/api/1.2/contexts/create post: https://adb-5463815377663355.15.azuredatabricks.net/api/1.2/commands/execute Get: https://adb-5463815377663355.15.azuredatabricks.net/api/1.2/commands/status Get: https://adb-5463815377663355.15.azuredatabricks.net/api/1.2/commands/status Get: https://adb-5463815377663355.15.azuredatabricks.net/api/1.2/commands/status ERROR: AttributeError: databaseName {"resultType": "error", "summary": "<span class=\"ansi-red-fg\">AttributeError: databaseName", "cause": "---------------------------------------------------------------------------\nValueError Traceback (most recent call last)\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1594 # but this will not be used in normal cases\n-> 1595 idx = self.fields.index(item)\n 1596 return self[idx]\n\nValueError: 'databaseName' is not in list\n\nDuring handling of the above exception, another exception occurred:\n\nAttributeError Traceback (most recent call last)\n in \n----> 1 all_dbs = [x.databaseName for x in spark.sql(\"show databases\").collect()]; print(len(all_dbs))\n\n in (.0)\n----> 1 all_dbs = [x.databaseName for x in spark.sql(\"show databases\").collect()]; print(len(all_dbs))\n\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1598 raise AttributeError(item)\n 1599 except ValueError:\n-> 1600 raise AttributeError(item)\n 1601 \n 1602 def setattr(self, key, value):\n\nAttributeError: databaseName"}

Traceback (most recent call last): File "./export_db.py", line 151, in main() File "./export_db.py", line 137, in main hive_c.export_hive_metastore(cluster_name=args.cluster_name) File "/Users/saldroubi/Dropbox/git/db-migration/dbclient/HiveClient.py", line 200, in export_hive_metastore all_dbs = self.log_all_databases(cid, ec_id, metastore_dir) File "/Users/saldroubi/Dropbox/git/db-migration/dbclient/HiveClient.py", line 21, in log_all_databases raise ValueError("Cannot identify number of databases due to the above error") ValueError: Cannot identify number of databases due to the above error PS /Users/saldroubi/Dropbox/git/db-migration>

mrchristine commented 4 years ago

Yes, should be possible w/ this option:

  --cluster-name CLUSTER_NAME
                        Cluster name to export the metastore to a specific
                        cluster. Cluster will be started.

Have you tried this option?

saldroubi commented 4 years ago

No this did not fix it, sorry. I see that the cluster started but then I get an error.

Is this because it is spark 3.0?

ERROR: AttributeError: databaseName {"resultType": "error", "summary": "AttributeError: databaseName", "cause": "---------------------------------------------------------------------------\nValueError Traceback (most recent call last)\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1594 # but this will not be used in normal cases\n-> 1595 idx = self.fields.index(item)\n 1596 return self[idx]\n\nValueError: 'databaseName' is not in list\n\nDuring handling of the above exception, another exception occurred:\n\nAttributeError Traceback (most recent call last)\n in \n----> 1 all_dbs = [x.databaseName for x in spark.sql("show databases").collect()]; print(len(all_dbs))\n\n in (.0)\n----> 1 all_dbs = [x.databaseName for x in spark.sql("show databases").collect()]; print(len(all_dbs))\n\n/databricks/spark/python/pyspark/sql/types.py in getattr(self, item)\n 1598 raise AttributeError(item)\n 1599 except ValueError:\n-> 1600 raise AttributeError(item)\n 1601 \n 1602 def setattr(self, key, value):\n\nAttributeError: databaseName"}

Traceback (most recent call last): File "./export_db.py", line 151, in main() File "./export_db.py", line 137, in main hive_c.export_hive_metastore(cluster_name=args.cluster_name) File "/Users/saldroubi/Dropbox/git/db-migration/dbclient/HiveClient.py", line 200, in export_hive_metastore all_dbs = self.log_all_databases(cid, ec_id, metastore_dir) File "/Users/saldroubi/Dropbox/git/db-migration/dbclient/HiveClient.py", line 21, in log_all_databases raise ValueError("Cannot identify number of databases due to the above error") ValueError: Cannot identify number of databases due to the above error

saldroubi commented 4 years ago

I just confirmed that it does NOT work with runtime version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12) But it works for runtime: 6.5 (includes Apache Spark 2.4.5, Scala 2.11)

saldroubi commented 4 years ago

I am closing this issue and reopening another more clear and concise about the problem.