oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
30 stars 22 forks source link

[Lake][DuckDB] Verify lake functionality and behavior is working as expected #1000

Closed idiom-bytes closed 5 months ago

idiom-bytes commented 5 months ago

Motivation

To verify that

We need to improve basic reliability and stability of the lake. The basic duckdb behavior needs to be working as expected.

We should verify things are working by keeping it simple, and focusing on the bronze_pdr_predictions table. I am recommending that we ignore pdr-subcriptions, pdr-slots, and possibly other tables so we can validate that the lake is behaving as expected.

image

Verification - Inserting data into the lake and manipulating it

When you first start interacting with the lake, there will be a large a fetch/update step that will try to build everything into the lake. As these records are processed, we begin inserting them into our DB.

  1. User should edit ppss such that the st_ts and end_ts to define their lake start & end time.
  2. User should run lake update command to start fetching data, and fill the whole lake.
  3. At any point, the user should be able to pause/cancel and resume w/o any errors
  4. Show user how to set end_ts to "Now" so lake continues to update forever.

Once the lake is built, it's very likely that many records will have null entries as they are initially inserted into the database. We are not worried about this for the moment.

Test - Cutting off the lake (dropping)

Let's first consider how our lake works. A certain amount of data and events arrive that need to be processed. Each time we do a run, we update a certain amount of records.

Run 1 Run 2 Run 3
Time 1:00 2:00 3:00

Let's say we wanted to drop everything since Run 1. We would call our cli drop command, and get rid of that data.

pdr lake drop 10000001 my_ppss.yaml sapphire-mainnet

Which might be the equivalent of dropping all records since Run 1 -> End. This would include the data from [Run 2, Run 3].

The user would continue updating the lake by calling pdr lake update... which would refetch and rebuild [Run 2, Run 3], getting the system up-to-date, and then continuing on from there,

Verifying

We could consider that by dropping/cutting off part of the lake, all tables would have the same data cut-off/rows-dropped like below. Such that the data pipeline can resume from here, and all tables can be updated/resumed from the same "height".

WITH 
max_pdr_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_predictions
),
max_pdr_payouts AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_payouts
),
max_pdr_subscriptions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_subscriptions
),
max_bronze_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_predictions
),
max_bronze_slots AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_slots
),
SELECT 
  'pdr_predictions' AS table_name, max_pdr_predictions.max_timestamp
FROM max_pdr_predictions
UNION ALL
SELECT 
  'pdr_payouts' AS table_name, max_pdr_payouts.max_timestamp
FROM max_pdr_payouts
UNION ALL
SELECT 
  'pdr_subscriptions' AS table_name, max_pdr_subscriptions.max_timestamp
FROM max_pdr_subscriptions
UNION ALL
SELECT 
  'bronze_pdr_predictions' AS table_name, max_bronze_predictions.max_timestamp
FROM max_bronze_predictions
UNION ALL
SELECT 
  'bronze_pdr_slots' AS table_name, max_bronze_slots.max_timestamp
FROM max_bronze_slots

DoD

Testing Data Pipeline Behavior

We need to verify that the basic workflows for inserting data are working. You should be able to do this step-by-step and have the lake and tables working, as expected.

Core Components - Raw Table

Core Components - ETL Table

trentmc commented 5 months ago

Issue 1000! :)

KatunaNorbert commented 5 months ago

pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now

Can we keep this tables so we have all the raw tables working? I don't see how these could slow us down

idiom-bytes commented 5 months ago

@KatunaNorbert they have been slowing us down in the testing, iteration, and many other things.

Objective Before: We implemented them because we wanted to move many things in parallel.

Objective Now: We want to pause them now so we can verify things in-order.

idiom-bytes commented 5 months ago
checkpoint is identifying the right places to st_ts and end_ts

Yes, I have reviewed the code end-to-end.

[Fetching GQL data from the right place]

[Preloading from CSV for SQL]

[Fetch all the way to the end]

-[x] you can stop/cancel/resume/pause, and things resume correctly and reliably

Yes, I have reviewed the code end-to-end.

-[x] the tables and records are being filled/appended correctly

All the data from GQL is updated to temp_tables, and the whole job needs to complete succesfully, before rows are added to duckdb.

I believe this is working correctly Screenshot from 2024-05-13 14-36-17

I believe both of these to be correct

KatunaNorbert commented 5 months ago

Issues:

KatunaNorbert commented 5 months ago

Fetching the data on the sapphire testnet is not working due to a subgraph issue on the payout data query side which is described inside this issue: #768

idiom-bytes commented 5 months ago

Updates in the latest PR are working well https://github.com/oceanprotocol/pdr-backend/pull/1077

Basically, tables are starting + ending at the same time, reliably across all 4 initial tables (predictions, truevals, payouts, and bronze_predictions). The number of rows/records look correct too. Screenshot from 2024-05-21 09-10-28

idiom-bytes commented 5 months ago

I created tickets were we discovered functionality is missing and are closing this ticket as we have been able to harden the lake end-to-end and the core objectives of this ticket have been achieved.