[Lake][ETL] Extend Lake CLI to provide interface, and strengthen SLA

Motivation

Right now, users must delete files manually, or print table.df.head() to get a sense of what they are working with in the lake. Right now, because there is no checkpoint (ppss.yaml is the checkpoint - ref issue #694), ETL has to rebuild everything rather than being able to operate incrementally regardless of how ppss.lake.st_ts and ppss.lake.end_ts are configured.

To help improve the SLA/interface with the Lake, the following CLI commands and interface may help such that we can introduce concepts to our ETL tables like "rolling-back the data" to an older-data, rebuilding, and continuing incrementally.

This may also help enforce basic concepts like: Part 1: Save fetch->raw data local, then load what's needed. Part 2: **Use ETL to extract what you need"

How will the ETL enforce it's SLAs?

[Checkpoint] Everything before the checkpoint, is considered valid. Everything after the checkpoint, is considered null.

[Process 1 - Ingest & Load]

All raw data is saved to disk (csv), and we only fetch/append what's new.
We store the raw data we need for ETL into DuckDB (type may change, example: UnixTimeMs)

[Process 2 - ETL]

ETL is then used to calculate whatever info we need.
Devs/Data Engineers can update the ETL logic, use CLI to manage (2)(3), then rebuild using original data (1).

ETL CLI

ETL CLI helps you interact w/ checkpoint + ppss.yaml, to manage the ETL job.

lake run - runs the lake update and etl process indefinitely
lake raw drop - drops rows from duckdb across all ohlcv tables from st_ts onwards. does not drop data from csvs.
lake raw update - if needed, ingests raw ohlcv and gql using ppss.lake.st_ts and ppss.lake.end_ts into csvs, then updates the database
lake etl drop - drops rows from duckdb across all etl tables from st_ts onwards, updates etl.checkpoint to st_ts
lake etl update - uses etl.checkpoint, ppss.lake.st_ts, and ppss.lake.end_ts, to complete ETL end-to-end and update the database
lake describe - prints all tables: [table summary (n_records, min_timestamp, max_timestamp), head/tail, etc..], validates schema, provides an overview of the lake
lake query - query db - pipe a SQL query directly to your lake/ PersistentStore

DoD

[x] Lake successfully updates end-to-end, incrementally, using checkpoint, st_ts, and end_ts = now #883
[x] From CLI run the lake/etl.py update-loop that goes end-to-end indefinetely
[x] From CLI manipulate the checkpoint, and rebuild the lake using an intuitive interface
[x] From CLI describe the lake #883
[x] From CLI run query against the lake
[x] Tests that provide coverage against all functionality are passing

oceanprotocol / pdr-backend