Closed HotSushi closed 4 months ago
Laying foundations for storage part 4: FileIOManager and FileIO implementations for HDFS and Local
FileIOManager
HDFS
Local
FileIOManager interface looks like:
FileIOManager { FileIO getFileIO(Type) }
This interface is accompanied by ConfigureFileIO which sets up FileIOs for all "configured" storages.
ConfigureFileIO
We do not replace the existing FileIO instances to ensure production systems do not break.
To learn the motivation behind these changes please see this doc
1) Deploy new services with new + old cluster yaml (along with new fileIOs and old fileIOs)
- storages - newconfs - storage - oldconfs
2) Make refactors/remove old usage safely (remove old fileIOs and use new fileIOs) 3) Switch to new cluster yaml completely.
- storages - newconfs
/infra/recipes/docker-compose/oh-hadoop-spark> docker compose up -d /infra/recipes/docker-compose/oh-hadoop-spark> docker exec -it local.spark-master /bin/bash
scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, col1 string, col2 string) PARTITIONED BY (days(ts))").show() ++ || ++ ++
scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (date_sub(CAST(current_timestamp() as DATE), 30), 'val1', 'val2')") res2: org.apache.spark.sql.DataFrame = []
scala> spark.sql("SELECT * FROM openhouse.db.tb").show() +-------------------+----+----+ | ts|col1|col2| +-------------------+----+----+ |2024-04-02 00:00:00|val1|val2| +-------------------+----+----+
we can observe logs like:
INFO 9 --- [ main] c.l.o.c.s.h.HdfsStorageClient : Initializing storage client for type:..
LGTM. Looking forward to more information about cutover from old config to new config in PR description. It's not blocking though.
Also please check why Build is failing
Summary
Laying foundations for storage part 4:
FileIOManager
and FileIO implementations forHDFS
andLocal
FileIOManager interface looks like:
This interface is accompanied by
ConfigureFileIO
which sets up FileIOs for all "configured" storages.We do not replace the existing FileIO instances to ensure production systems do not break.
To learn the motivation behind these changes please see this doc
What's the next plan
1) Deploy new services with new + old cluster yaml (along with new fileIOs and old fileIOs)
2) Make refactors/remove old usage safely (remove old fileIOs and use new fileIOs) 3) Switch to new cluster yaml completely.
Changes
Testing Done
scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, col1 string, col2 string) PARTITIONED BY (days(ts))").show() ++ || ++ ++
scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (date_sub(CAST(current_timestamp() as DATE), 30), 'val1', 'val2')") res2: org.apache.spark.sql.DataFrame = []
scala> spark.sql("SELECT * FROM openhouse.db.tb").show() +-------------------+----+----+ | ts|col1|col2| +-------------------+----+----+ |2024-04-02 00:00:00|val1|val2| +-------------------+----+----+
INFO 9 --- [ main] c.l.o.c.s.h.HdfsStorageClient : Initializing storage client for type:..