mrchristine / db-migration

Databricks Migration Tools
Other
43 stars 27 forks source link

Deleted Clusters requiring cluster id to exist before being imported again #67

Open Anyaoha opened 3 years ago

Anyaoha commented 3 years ago

I think there's an issue with the mapping and import of clusters.


So if I try to import my clusters from say workspace A into workspace B.. It all works perfectly.

However, if I delete one cluster from workspace B and would want to re-run the entire export from A and import to B, I get an error which just is related to mapping because the import process is expecting the cluster id to exist in workspace B in order to have it imported. But that's weird because if the cluster id exists, I wouldn't be needing to import right.

Here's the error:

Applying acl for < _cluster 1*********** (deleted cluster which I'm expecting to bring in back_>
Get: https://adb-*******************.*.azuredatabricks.net/api/2.0/clusters/list
Traceback (most recent call last):
  File "/project/import_db.py", line 197, in <module>
    main()
  File "/project/import_db.py", line 96, in main
    cl_c.import_cluster_configs()
  File "/project/dbclient/ClustersClient.py", line 272, in import_cluster_configs
    raise ValueError('Cluster id must exist in new env. Re-import cluster configs.')
ValueError: Cluster id must exist in new env. Re-import cluster configs.
mrchristine commented 3 years ago

I followed your steps and was unable to reproduce this issue. It looks like you might have conflicting data in your export / import when identifying the cluster id mapping since cluster ids are not persisted across workspaces.

Make sure you either set a new export directory, or reset the export directory when you export the source workspace clusters and try again.

  --reset-exports       Clear export directory
  --set-export-dir SET_EXPORT_DIR
                        Set the base directory to export artifacts
Anyaoha commented 3 years ago

I further debugged the issue and found out that databricks export and import processes doesn't use the same instance pool id.

(base) C:\Users\<user name>\>databricks instance-pools list --profile DEMO

(base) C:\Users\<user name>\>databricks instance-pools list --profile NEW_DEMO

The above commands returned different IDs for the same instance pool name.

So I suspect that any cluster with a policy of with same instance pool would have a challenge with the import.

mrchristine commented 3 years ago

Correct, you cannot map the same id to multiple workspaces. Pool Ids, Job IDs, Cluster IDs, any IDs that we generate cannot be defined in other workspaces. Hence this is likely the cause for the error above.

This is why there's a map function where we create a lookup table per name, and get the new ID generate in the new environment.