linkedin / openhouse

Open Control Plane for Tables in Data Lakehouse
https://www.openhousedb.org/
BSD 2-Clause "Simplified" License
294 stars 50 forks source link

Introduce FileIOManager and FileIO implementations for HDFS and Local Storage #96

Closed HotSushi closed 4 months ago

HotSushi commented 5 months ago

Summary

Laying foundations for storage part 4: FileIOManager and FileIO implementations for HDFS and Local

FileIOManager interface looks like:

FileIOManager {
   FileIO getFileIO(Type)
}

This interface is accompanied by ConfigureFileIO which sets up FileIOs for all "configured" storages.

We do not replace the existing FileIO instances to ensure production systems do not break.

To learn the motivation behind these changes please see this doc

What's the next plan

1) Deploy new services with new + old cluster yaml (along with new fileIOs and old fileIOs)

- storages
    - newconfs
- storage
    - oldconfs  

2) Make refactors/remove old usage safely (remove old fileIOs and use new fileIOs) 3) Switch to new cluster yaml completely.

- storages
    - newconfs

Changes

Testing Done

scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, col1 string, col2 string) PARTITIONED BY (days(ts))").show() ++ || ++ ++

scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (date_sub(CAST(current_timestamp() as DATE), 30), 'val1', 'val2')") res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("SELECT * FROM openhouse.db.tb").show() +-------------------+----+----+ | ts|col1|col2| +-------------------+----+----+ |2024-04-02 00:00:00|val1|val2| +-------------------+----+----+

we can observe logs like:

INFO 9 --- [ main] c.l.o.c.s.h.HdfsStorageClient : Initializing storage client for type:..

sumedhsakdeo commented 4 months ago

LGTM. Looking forward to more information about cutover from old config to new config in PR description. It's not blocking though.

sumedhsakdeo commented 4 months ago

Also please check why Build is failing