linkedin / openhouse

Open Control Plane for Tables in Data Lakehouse
https://www.openhousedb.org/
BSD 2-Clause "Simplified" License
294 stars 50 forks source link

Refactor: Remove FsStorageProvider from Table Service #104

Closed HotSushi closed 4 months ago

HotSushi commented 4 months ago

Summary

❗❗ The volume of this PR might be intimidating, please read through the description to make reviewing easier.

The primary goal of this PR is to remove FsStorageProvider from :services:tables (and additionally :tables-test-fixture)

To achieve this goal, we do: 1: Add additional methods to StorageClient interface

StorageClient {
   + String getRootPrefix()
   + String getEndpoint()
}

2: Add additional method to FileIOManager to reverse lookup a Storage for a FileIO

FileIOManager {
   + Storage getStorage(fileIO)
}

3: Map FsStorageProvider methods to StorageManager

# pattern 1
storageProvider.rootPath() ->  
    storageManager.getDefaultStorage().getClient().getRootPrefix()

# pattern 2
storageProvider.endPoint() ->  
    storageManager.getDefaultStorage().getClient().getEndPoint()

# pattern 3
storageProvider.storageClient() -> 
    if(ensureHDFS) {
          storageManager.getDefaultStorage().getClient().getNativeClient()
    } else {
          throw exception or log 
    }

4: Fix log for StorageType

Instead of printing objecthash, print the type properly, ie StorageType.Type(local)

👆 thats it

Is this the end state of refactoring? NO. The goal of the PR is limited to removing fsStorageProvider. In future PRs we'll address CTAS, fileSecurer, stripPrefix, defaultTableLocation etc, currently these have been tagged with TODO

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (current_timestamp(), 'val1', 'val2')") res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("SELECT * FROM openhouse.db.tb").show() +--------------------+----+----+ | ts|col1|col2| +--------------------+----+----+ |2024-05-13 23:52:...|val1|val2| +--------------------+----+----+

scala> spark.sql("CREATE TABLE openhouse.db.tb0 AS SELECT 1").show() ++ || ++ ++

scala> spark.sql("SELECT * FROM openhouse.db.tb0").show() +---+ | 1| +---+ | 1| +---+

Audit events for these tests: {"eventTimestamp":"1715644331261","clusterName":"LocalHadoopCluster","databaseName":"db","tableName":"tb","user":"openhouse","operationType":"CREATE","operationStatus":"SUCCESS","currentTableRoot":"hdfs://namenode:9000/data/openhouse/db/tb-97df14b9-752e-4b7f-b540-5058c784994c/00000-d4aca2a1-c8bc-40eb-8449-5b70d2e24598.metadata.json","grantor":null,"grantee":null,"role":null}

{"eventTimestamp":"1715644370032","clusterName":"LocalHadoopCluster","databaseName":"db","tableName":"tb","user":"openhouse","operationType":"COMMIT","operationStatus":"SUCCESS","currentTableRoot":"hdfs://namenode:9000/data/openhouse/db/tb-97df14b9-752e-4b7f-b540-5058c784994c/00001-54ef9c91-4747-4852-ba5b-adfb0b4f6fb3.metadata.json","grantor":null,"grantee":null,"role":null}

{"eventTimestamp":"1715644425694","clusterName":"LocalHadoopCluster","databaseName":"db","tableName":"tb","user":"openhouse","operationType":"READ","operationStatus":"SUCCESS","currentTableRoot":"hdfs://namenode:9000/data/openhouse/db/tb-97df14b9-752e-4b7f-b540-5058c784994c/00001-54ef9c91-4747-4852-ba5b-adfb0b4f6fb3.metadata.json","grantor":null,"grantee":null,"role":null}


- [ ✅ ] Added new tests for the changes made.
- [ ✅ ] Updated existing tests to reflect the changes made.
- [ ✅ ] Some other form of testing like staging or soak time in production. Please explain.

# Additional Information
- [X] Deprecations
      - remove fsStorageProvider 
- [X] Large PR broken into smaller PRs, and PR plan linked in the description.
      - https://github.com/linkedin/openhouse/pull/100
HotSushi commented 4 months ago

cc: @ctrezzo