Right now, users must delete files manually, or print table.df.head() to get a sense of what they are working with in the lake. Right now, because there is no checkpoint (ppss.yaml is the checkpoint - ref issue #694), ETL has to rebuild everything rather than being able to operate incrementally regardless of how ppss.lake.st_ts and ppss.lake.end_ts are configured.
To help improve the SLA/interface with the Lake, the following CLI commands and interface may help such that we can introduce concepts to our ETL tables like "rolling-back the data" to an older-data, rebuilding, and continuing incrementally.
This may also help enforce basic concepts like:
Part 1: Save fetch->raw data local, then load what's needed.
Part 2: **Use ETL to extract what you need"
How will the ETL enforce it's SLAs?
[Checkpoint]
Everything before the checkpoint, is considered valid.
Everything after the checkpoint, is considered null.
[Process 1 - Ingest & Load]
All raw data is saved to disk (csv), and we only fetch/append what's new.
We store the raw data we need for ETL into DuckDB (type may change, example: UnixTimeMs)
[Process 2 - ETL]
ETL is then used to calculate whatever info we need.
Devs/Data Engineers can update the ETL logic, use CLI to manage (2)(3), then rebuild using original data (1).
ETL CLI
ETL CLI helps you interact w/ checkpoint + ppss.yaml, to manage the ETL job.
lake run - runs the lake update and etl process indefinitely
lake raw drop - drops rows from duckdb across all ohlcv tables from st_ts onwards. does not drop data from csvs.
lake raw update - if needed, ingests raw ohlcv and gql using ppss.lake.st_ts and ppss.lake.end_ts into csvs, then updates the database
lake etl drop - drops rows from duckdb across all etl tables from st_ts onwards, updates etl.checkpoint to st_ts
lake etl update - uses etl.checkpoint, ppss.lake.st_ts, and ppss.lake.end_ts, to complete ETL end-to-end and update the database
lake describe - prints all tables: [table summary (n_records, min_timestamp, max_timestamp), head/tail, etc..], validates schema, provides an overview of the lake
lake query - query db - pipe a SQL query directly to your lake/ PersistentStore
DoD
[x] Lake successfully updates end-to-end, incrementally, using checkpoint, st_ts, and end_ts = now#883
[x] From CLI run the lake/etl.py update-loop that goes end-to-end indefinetely
[x] From CLI manipulate the checkpoint, and rebuild the lake using an intuitive interface
Motivation
Right now, users must delete files manually, or
print table.df.head()
to get a sense of what they are working with in the lake. Right now, because there is no checkpoint (ppss.yaml is the checkpoint - ref issue #694), ETL has to rebuild everything rather than being able to operate incrementally regardless of howppss.lake.st_ts
andppss.lake.end_ts
are configured.To help improve the SLA/interface with the Lake, the following CLI commands and interface may help such that we can introduce concepts to our ETL tables like "rolling-back the data" to an older-data, rebuilding, and continuing incrementally.
This may also help enforce basic concepts like: Part 1: Save fetch->raw data local, then load what's needed. Part 2: **Use ETL to extract what you need"
How will the ETL enforce it's SLAs?
[Checkpoint] Everything before the checkpoint, is considered valid. Everything after the checkpoint, is considered null.
[Process 1 - Ingest & Load]
[Process 2 - ETL]
ETL CLI
ETL CLI helps you interact w/ checkpoint + ppss.yaml, to manage the ETL job.
ppss.lake.st_ts
andppss.lake.end_ts
into csvs, then updates the databasest_ts
onwards, updatesetl.checkpoint
tost_ts
etl.checkpoint
,ppss.lake.st_ts
, andppss.lake.end_ts
, to complete ETL end-to-end and update the databaseDoD
now
#883lake/etl.py
update-loop that goes end-to-end indefinetely