Closed idiom-bytes closed 5 months ago
I'm reviewing this and basically just getting a bit confused with our code...
TLDR; TableRegistry still plays an important part of the code base but recent additions have made the overall usage confusing.
[Considerations] (1) I find that Table, NamedTables, TableRegistry, responsibilities and usage is getting a bit burdensome
(2) Although there are benefits in making NamedTables
an object such as ETLTable
, so we can use pythonic syntax of isinstance(obj, ETLTable)
, the final result for how we use it elsewhere becomes burdensome... rather, please see the example below
(3) If you want an actual table dataclass, you need to go through the TableRegistry() which feels bloated
(4) really, I should be able to just call DuckDB + Table and start operating on the db/csv, which is what I was expecting to encounter
db = DuckDBDataStore(ppss)
predictions_table = Predictions(ppss)
# or
predictions_table = to_dataclass("predictions", ppss)
df1 = db.query_data("select * from %s limit 100;".format(table.name)
df2 = db.query_data("select * from %s limit 100;".format(table.name_temp)
df3 = db.query_data("select * from %s limit 100;".format(table.name_etl)
There were some abstractions that remained here since the days of polars, because we did not have a database and the ability to easily query records.... Now that everything is on DuckDB, we might want to clean up a few things here...
I agree with removing the TableRegistry
class, we are only using it inside GQL and the scope of it is to store the tables, but for that we could only have just a config constant that keeps track of the table names and fetch functions such as _GQLDF_REGISTERED_LAKE_TABLES
.
Regarding Table classes, there is a lot of logic around getting the name and other methods are working on top of duckdb, beside the _append_to_csv
functionality, which we could move somewhere else. So we could simply just remove the Table concept and use DuckDBDataStore
as I think @idiom-bytes was referring to above
Regarding Table classes, there is a lot of logic around getting the name and other methods are working on top of duckdb, beside the _append_to_csv functionality, which we could move somewhere else. So we could simply just remove the Table concept and use DuckDBDataStore as I think @idiom-bytes was referring to above
You're bang on @KatunaNorbert, but... baby steps. The Table concept is still useful for managing CSVs (folder/bucket) and Tables/LakeMappers are still important for us to manage schemas and to provide an interface to those objects. However, a lot of this could be simplified now that we have these workflows in place and actual DB w/ tables.
The requirements for this ticket have been addressed. Let's close the issue and revisit this later on.
Background / motivation
GQLDF, ETL, and NamedTables have been expanded to improve how tables are registered, referenced, and queried. As a result, the implementations made in TableRegistry look outdated and redundant.
I propose we deprecate them.
https://github.com/oceanprotocol/pdr-backend/commit/96233004a380223962c01c64b1d767d442069e90
TODOs / DoD
Deprecate TableRegistry
Tasks: