[TableRegistry] Review/Deprecate TableRegistry after the latest updates to GQLDF, ETL, and NamedTables

idiom-bytes commented 5 months ago

Background / motivation

GQLDF, ETL, and NamedTables have been expanded to improve how tables are registered, referenced, and queried. As a result, the implementations made in TableRegistry look outdated and redundant.

I propose we deprecate them.

https://github.com/oceanprotocol/pdr-backend/commit/96233004a380223962c01c64b1d767d442069e90

TODOs / DoD

Deprecate TableRegistry

Tasks:

[x] Remove TableRegistry
[x] Update all table checks/updates/references to use Named/Temp/ETL tables
[x] Update all table queries to use Table objects and NamedTable/TempTable/ETLTable interfaces

idiom-bytes commented 5 months ago

I'm reviewing this and basically just getting a bit confused with our code...

TLDR; TableRegistry still plays an important part of the code base but recent additions have made the overall usage confusing.

[Considerations] (1) I find that Table, NamedTables, TableRegistry, responsibilities and usage is getting a bit burdensome

(2) Although there are benefits in making NamedTables an object such as ETLTable, so we can use pythonic syntax of isinstance(obj, ETLTable), the final result for how we use it elsewhere becomes burdensome... rather, please see the example below

(3) If you want an actual table dataclass, you need to go through the TableRegistry() which feels bloated

(4) really, I should be able to just call DuckDB + Table and start operating on the db/csv, which is what I was expecting to encounter

db = DuckDBDataStore(ppss)

predictions_table = Predictions(ppss)
# or
predictions_table = to_dataclass("predictions", ppss)

df1 = db.query_data("select * from %s limit 100;".format(table.name)
df2 = db.query_data("select * from %s limit 100;".format(table.name_temp)
df3 = db.query_data("select * from %s limit 100;".format(table.name_etl)

There were some abstractions that remained here since the days of polars, because we did not have a database and the ability to easily query records.... Now that everything is on DuckDB, we might want to clean up a few things here...

KatunaNorbert commented 5 months ago

I agree with removing the TableRegistry class, we are only using it inside GQL and the scope of it is to store the tables, but for that we could only have just a config constant that keeps track of the table names and fetch functions such as _GQLDF_REGISTERED_LAKE_TABLES.
Regarding Table classes, there is a lot of logic around getting the name and other methods are working on top of duckdb, beside the _append_to_csv functionality, which we could move somewhere else. So we could simply just remove the Table concept and use DuckDBDataStore as I think @idiom-bytes was referring to above

idiom-bytes commented 5 months ago

Regarding Table classes, there is a lot of logic around getting the name and other methods are working on top of duckdb, beside the _append_to_csv functionality, which we could move somewhere else. So we could simply just remove the Table concept and use DuckDBDataStore as I think @idiom-bytes was referring to above

You're bang on @KatunaNorbert, but... baby steps. The Table concept is still useful for managing CSVs (folder/bucket) and Tables/LakeMappers are still important for us to manage schemas and to provide an interface to those objects. However, a lot of this could be simplified now that we have these workflows in place and actual DB w/ tables.

The requirements for this ticket have been addressed. Let's close the issue and revisit this later on.

oceanprotocol / pdr-backend

[TableRegistry] Review/Deprecate TableRegistry after the latest updates to GQLDF, ETL, and NamedTables #1088

Background / motivation

TODOs / DoD