oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
28 stars 22 forks source link

[Lake][ETL] Define ETL SLA for how it handles: rollbacks, resuming, & incremental. #667

Closed idiom-bytes closed 6 months ago

idiom-bytes commented 6 months ago

Motivation

  1. Right now when the ETL starts + completes, the user needs to manually adjust ppss.yaml, in order to run the pipeline incrementally.
  2. When the pipeline ends and the user restarts, there is no clear guideline for whether we drop all the records and recreate... or if we simply update what's there.

Preferrably, after the pipeline completes, ppss.yaml has been updated to checkpoint up-to where the pipeline completed, and resume from there. In other words:

  1. Only update what hasn't been processed.
  2. Drop all records, and re-create them.

Whats our SLA

Service-Level-Agreement

  1. Keep ppss.yaml fixed. This means that all records + data that need to be override are dropped, and then recreated again.
  2. Update ppss.yaml fixed. This means that the pipeline is dumb, and will just resume from there.
idiom-bytes commented 6 months ago

I have created the following two tickets as a way to implement and enforce SLAs through the codebase, this will help to better define the ETL SLA and how it's enforced. The two basic concepts are: ETL CLI - https://github.com/oceanprotocol/pdr-backend/issues/703 ETL Checkpoint - https://github.com/oceanprotocol/pdr-backend/issues/694

I'm therefore closing this issue as we continue to define how it works and how it's used.