oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
26 stars 21 forks source link

[Lake][ETL] Extend Lake CLI to provide interface, and strengthen SLA #703

Closed idiom-bytes closed 3 months ago

idiom-bytes commented 6 months ago

Motivation

Right now, users must delete files manually, or print table.df.head() to get a sense of what they are working with in the lake. Right now, because there is no checkpoint (ppss.yaml is the checkpoint - ref issue #694), ETL has to rebuild everything rather than being able to operate incrementally regardless of how ppss.lake.st_ts and ppss.lake.end_ts are configured.

To help improve the SLA/interface with the Lake, the following CLI commands and interface may help such that we can introduce concepts to our ETL tables like "rolling-back the data" to an older-data, rebuilding, and continuing incrementally.

This may also help enforce basic concepts like: Part 1: Save fetch->raw data local, then load what's needed. Part 2: **Use ETL to extract what you need"

How will the ETL enforce it's SLAs?

[Checkpoint] Everything before the checkpoint, is considered valid. Everything after the checkpoint, is considered null.

[Process 1 - Ingest & Load]

  1. All raw data is saved to disk (csv), and we only fetch/append what's new.
  2. We store the raw data we need for ETL into DuckDB (type may change, example: UnixTimeMs)

[Process 2 - ETL]

  1. ETL is then used to calculate whatever info we need.
  2. Devs/Data Engineers can update the ETL logic, use CLI to manage (2)(3), then rebuild using original data (1).

image

ETL CLI

ETL CLI helps you interact w/ checkpoint + ppss.yaml, to manage the ETL job.

DoD

idiom-bytes commented 3 months ago

This ticket is now resolved as an outcome of the work.