linkedin / openhouse

Open Control Plane for Tables in Data Lakehouse
https://www.openhousedb.org/
BSD 2-Clause "Simplified" License
294 stars 50 forks source link

Use fileIO to delete files in dropTable #94

Closed jainlavina closed 5 months ago

jainlavina commented 5 months ago

Summary

Use fileIO instead of filesystem to delete data and metadata files in dropTable. It uses deletePrefix() and expects an instance of FileIO that supports prefix operations. If table operations are instantiated with a fileIO instance that does not extend from SupportPrefixOperations then it will throw an exception. But, that is ok because all popular FileIO implementations like HadoopFileIO and major cloud providers support the prefix operations.

This change allows us to completely eliminate dependency on hadoop filesystem for the catalog.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Tested using docker and existing e2e tests that cover dropTable. Also, added logs and validated by manually inspecting logs in docker.

Tested by inspecting namenode in docker.

  1. Created table lj_test_tbl in db "ljdb": $ curl "${curlArgs[@]}" -XPOST http://localhost:8000/v1/databases/ljdb/tables/ --data-raw '{ "tableId": "lj_test_tbl", "databaseId": "ljdb", "baseTableVersion": "INITIAL_VERSION", "clusterId": "LocalHadoopCluster", .......

  2. Verified that the folder and files exist in hdfs:

    Screenshot 2024-04-30 at 2 59 13 PM
  3. Dropped table. $ curl "${curlArgs[@]}" -XDELETE http://localhost:8000/v1/databases/ljdb/tables/lj_test_tbl

  4. Verified on namenode in docker that the directory got deleted in hdfs:

    Screenshot 2024-04-30 at 3 03 49 PM

Docker test output.

  1. Create table: $ curl "${curlArgs[@]}" -XPOST http://localhost:8000/v1/databases/d3/tables/ --data-raw '{ "tableId": "t1", "databaseId": "d3", "baseTableVersion": "INITIAL_VERSION", "clusterId": "LocalHadoopCluster", "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}", "timePartitioning": { "columnName": "ts", "granularity": "HOUR" }, "clustering": [ { "columnName": "name" } ], "tableProperties": { "key": "value" } }' | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2152 0 1578 100 574 1218 443 0:00:01 0:00:01 --:--:-- 1663 { "clusterId" : "LocalHadoopCluster", "clustering" : [ { "columnName" : "name", "transform" : null } ], "creationTime" : 1714428252917, "databaseId" : "d3", "lastModifiedTime" : 1714428252917, "policies" : null, "schema" : "{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}", "tableCreator" : "openhouse", "tableId" : "t1", "tableLocation" : "hdfs://namenode:9000/data/openhouse/d3/t1-eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3/00000-9076ff1b-5823-449f-b31c-d0d653f3e18f.metadata.json", "tableProperties" : { "key" : "value", "openhouse.clusterId" : "LocalHadoopCluster", "openhouse.creationTime" : "1714428252917", "openhouse.databaseId" : "d3", "openhouse.lastModifiedTime" : "1714428252917", "openhouse.tableCreator" : "openhouse", "openhouse.tableId" : "t1", "openhouse.tableLocation" : "/data/openhouse/d3/t1-eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3/00000-9076ff1b-5823-449f-b31c-d0d653f3e18f.metadata.json", "openhouse.tableType" : "PRIMARY_TABLE", "openhouse.tableUUID" : "eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3", "openhouse.tableUri" : "LocalHadoopCluster.d3.t1", "openhouse.tableVersion" : "INITIAL_VERSION", "policies" : "", "write.format.default" : "orc", "write.metadata.delete-after-commit.enabled" : "true", "write.metadata.previous-versions-max" : "28" }, "tableType" : "PRIMARY_TABLE", "tableUUID" : "eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3", "tableUri" : "LocalHadoopCluster.d3.t1", "tableVersion" : "INITIAL_VERSION", "timePartitioning" : { "columnName" : "ts", "granularity" : "HOUR" } }

  2. Get table: $ curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/t1 | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1578 0 1578 0 0 8131 0 --:--:-- --:--:-- --:--:-- 8092 { "clusterId" : "LocalHadoopCluster", "clustering" : [ { "columnName" : "name", "transform" : null } ], "creationTime" : 1714428252917, "databaseId" : "d3", "lastModifiedTime" : 1714428252917, "policies" : null, "schema" : "{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}", "tableCreator" : "openhouse", "tableId" : "t1", "tableLocation" : "hdfs://namenode:9000/data/openhouse/d3/t1-eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3/00000-9076ff1b-5823-449f-b31c-d0d653f3e18f.metadata.json", "tableProperties" : { "key" : "value", "openhouse.clusterId" : "LocalHadoopCluster", "openhouse.creationTime" : "1714428252917", "openhouse.databaseId" : "d3", "openhouse.lastModifiedTime" : "1714428252917", "openhouse.tableCreator" : "openhouse", "openhouse.tableId" : "t1", "openhouse.tableLocation" : "/data/openhouse/d3/t1-eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3/00000-9076ff1b-5823-449f-b31c-d0d653f3e18f.metadata.json", "openhouse.tableType" : "PRIMARY_TABLE", "openhouse.tableUUID" : "eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3", "openhouse.tableUri" : "LocalHadoopCluster.d3.t1", "openhouse.tableVersion" : "INITIAL_VERSION", "policies" : "", "write.format.default" : "orc", "write.metadata.delete-after-commit.enabled" : "true", "write.metadata.previous-versions-max" : "28" }, "tableType" : "PRIMARY_TABLE", "tableUUID" : "eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3", "tableUri" : "LocalHadoopCluster.d3.t1", "tableVersion" : "INITIAL_VERSION", "timePartitioning" : { "columnName" : "ts", "granularity" : "HOUR" } }

  3. List tables in a db: $ curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/ | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1592 0 1592 0 0 25026 0 --:--:-- --:--:-- --:--:-- 25269 { "results" : [ { "clusterId" : "LocalHadoopCluster", "clustering" : [ { "columnName" : "name", "transform" : null } ], "creationTime" : 1714428252917, "databaseId" : "d3", "lastModifiedTime" : 1714428252917, "policies" : null, "schema" : "{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}", "tableCreator" : "openhouse", "tableId" : "t1", "tableLocation" : "hdfs://namenode:9000/data/openhouse/d3/t1-eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3/00000-9076ff1b-5823-449f-b31c-d0d653f3e18f.metadata.json", "tableProperties" : { "key" : "value", "openhouse.clusterId" : "LocalHadoopCluster", "openhouse.creationTime" : "1714428252917", "openhouse.databaseId" : "d3", "openhouse.lastModifiedTime" : "1714428252917", "openhouse.tableCreator" : "openhouse", "openhouse.tableId" : "t1", "openhouse.tableLocation" : "/data/openhouse/d3/t1-eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3/00000-9076ff1b-5823-449f-b31c-d0d653f3e18f.metadata.json", "openhouse.tableType" : "PRIMARY_TABLE", "openhouse.tableUUID" : "eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3", "openhouse.tableUri" : "LocalHadoopCluster.d3.t1", "openhouse.tableVersion" : "INITIAL_VERSION", "policies" : "", "write.format.default" : "orc", "write.metadata.delete-after-commit.enabled" : "true", "write.metadata.previous-versions-max" : "28" }, "tableType" : "PRIMARY_TABLE", "tableUUID" : "eb5975fd-f68d-44d7-9fa4-9b5a4b98a7b3", "tableUri" : "LocalHadoopCluster.d3.t1", "tableVersion" : "INITIAL_VERSION", "timePartitioning" : { "columnName" : "ts", "granularity" : "HOUR" } } ] }

  4. Drop table: $ curl "${curlArgs[@]}" -XDELETE http://localhost:8000/v1/databases/d3/tables/t1

  5. List tables in a db again: $ curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/ | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 14 0 14 0 0 162 0 --:--:-- --:--:-- --:--:-- 160 { "results" : [] }

Additional Information

For all the boxes checked, include additional details of the changes made in this pull request.

jainlavina commented 5 months ago

in general having question about testing on this

Updated screenshots of HDFS namenode showing table directory and files after table creation and validating that they are gone after deleting table.

jainlavina commented 5 months ago

in general having question about testing on this

Updated details on testing in PR description.