[Lake][DuckDB] Verify lake functionality and behavior is working as expected

idiom-bytes commented 5 months ago

Motivation

To verify that

The lake is working reliably, as expected
The basic flows and tables are working reliably, as expected
The commands and lake behavior is working as expected

We need to improve basic reliability and stability of the lake. The basic duckdb behavior needs to be working as expected.

We should verify things are working by keeping it simple, and focusing on the bronze_pdr_predictions table. I am recommending that we ignore pdr-subcriptions, pdr-slots, and possibly other tables so we can validate that the lake is behaving as expected.

Verification - Inserting data into the lake and manipulating it

When you first start interacting with the lake, there will be a large a fetch/update step that will try to build everything into the lake. As these records are processed, we begin inserting them into our DB.

User should edit ppss such that the st_ts and end_ts to define their lake start & end time.
User should run lake update command to start fetching data, and fill the whole lake.
At any point, the user should be able to pause/cancel and resume w/o any errors
Show user how to set end_ts to "Now" so lake continues to update forever.

Once the lake is built, it's very likely that many records will have null entries as they are initially inserted into the database. We are not worried about this for the moment.

Test - Cutting off the lake (dropping)

Let's first consider how our lake works. A certain amount of data and events arrive that need to be processed. Each time we do a run, we update a certain amount of records.

	Run 1	Run 2	Run 3
Time	1:00	2:00	3:00

Let's say we wanted to drop everything since Run 1. We would call our cli drop command, and get rid of that data.

pdr lake drop 10000001 my_ppss.yaml sapphire-mainnet

Which might be the equivalent of dropping all records since Run 1 -> End. This would include the data from [Run 2, Run 3].

The user would continue updating the lake by calling pdr lake update... which would refetch and rebuild [Run 2, Run 3], getting the system up-to-date, and then continuing on from there,

Verifying

We could consider that by dropping/cutting off part of the lake, all tables would have the same data cut-off/rows-dropped like below. Such that the data pipeline can resume from here, and all tables can be updated/resumed from the same "height".

WITH 
max_pdr_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_predictions
),
max_pdr_payouts AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_payouts
),
max_pdr_subscriptions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM pdr_subscriptions
),
max_bronze_predictions AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_predictions
),
max_bronze_slots AS (
  SELECT MAX(timestamp) AS max_timestamp FROM bronze_pdr_slots
),
SELECT 
  'pdr_predictions' AS table_name, max_pdr_predictions.max_timestamp
FROM max_pdr_predictions
UNION ALL
SELECT 
  'pdr_payouts' AS table_name, max_pdr_payouts.max_timestamp
FROM max_pdr_payouts
UNION ALL
SELECT 
  'pdr_subscriptions' AS table_name, max_pdr_subscriptions.max_timestamp
FROM max_pdr_subscriptions
UNION ALL
SELECT 
  'bronze_pdr_predictions' AS table_name, max_bronze_predictions.max_timestamp
FROM max_bronze_predictions
UNION ALL
SELECT 
  'bronze_pdr_slots' AS table_name, max_bronze_slots.max_timestamp
FROM max_bronze_slots

DoD

[x] It's hard to break the lake
[x] The basic lake flow is working end-to-end as expected
[x] Raw and Bronze tables are updating as expected
[x] Data is fetching, yielding, pausing, resuming, as expected
[x] You are able to manipulate the lake to get the results you expected
[x] bronze-predictions table has null records in it and that's ok
[x] other tables (bronze-slots) and queries (update bronze-predictions) are disabled for the moment
[x] update drop command to not use raw or etl filter and remove data from all the table

Testing Data Pipeline Behavior

We need to verify that the basic workflows for inserting data are working. You should be able to do this step-by-step and have the lake and tables working, as expected.

[x] Changing data feeds (BTC/ETH 5m or 1h) should not impact how we fetch data from subgraph for lake. Lake fetches everything from subgraph, filtering is done at the end.
[x] Pausing/resuming the pipeline shouldn't cause any issues. Resuming the pipeline with lake update should just work.
[x] Setting up st_ts (2024-04-01) and end_ts (now) should cause all data to be updated correctly
[x] Changing st_ts (2024-04-01) to earlier should not cause data to backfill
[x] Changing end_ts (2024-05-01) to earlier than max(timestamp) should not cause data to drop or anything drastic. Build command will just stop doing anything (the lake is considered full/complete).
[x] All data should be fetched correctly and we should have all basic tables up-to-date.
[x] pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now
[x] There will be null records inside of pdr-predictions and pdr-slots
[x] If we drop lake tables, we should also drop etl tables for the same st_ts -> end_ts
[x] If we use drop CLI command all tables should be updated but CSVs should remain intact
[x] The only way to fetch retroactively, is to change ppss.yaml and drop the raw tables.

Core Components - Raw Table

[x] checkpoint is identifying the right places to st_ts and end_ts
[x] you can stop/cancel/resume/pause, and things resume correctly and reliably
[x] the tables and records are being filled/appended correctly
[x] inserting to duckdb starts after all of GQL + CSV has ended
[x] inserting to duckdb cannot start/end part-way
[x] raw tables are updated correctly
[x] there are no duplicates in the raw table

Core Components - ETL Table

[x] checkpoint is identifying the right places to st_ts and end_ts #1046
[x] you can stop/cancel/resume/pause, and things resume correctly and reliably
[x] the tables and records are being filled/appended correctly
[x] inserting to bronze tables starts after all of GQL + CSV + Raw Tables has ended
[x] inserting to bronze cannot start/end part-way
[x] bronze predictions table are updated correctly
[x] there are no gaps in bronze_predictions table
[x] there are no duplicates in the bronze table

trentmc commented 5 months ago

Issue 1000! :)

KatunaNorbert commented 5 months ago

pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now

Can we keep this tables so we have all the raw tables working? I don't see how these could slow us down

idiom-bytes commented 5 months ago

@KatunaNorbert they have been slowing us down in the testing, iteration, and many other things.

Objective Before: We implemented them because we wanted to move many things in parallel.

Objective Now: We want to pause them now so we can verify things in-order.

idiom-bytes commented 5 months ago

checkpoint is identifying the right places to st_ts and end_ts

Yes, I have reviewed the code end-to-end.

It looks to correctly calculate where to start and ed
It looks at the CSV data to figure out where to resume from
It does not look to be creating any gaps
It looks like it's fetching all the way to the end

[Fetching GQL data from the right place]

it gets each table and figures out where it should start from the last CSV record
identifies the last_timestamp in the csv file
now it can resume from where it left off

[Preloading from CSV for SQL]

now it checks where the CSV last_timestamp and the DB last_timestamp
if the CSV last_timestamp greater than DB last_timestamp, it will dump from CSV => DB temp table
now the temp-table is preloaded with everything it will need
now it starts fetching from GQL + adding to CSV & table
if the loop fails, then _prepare_temp_table() should fill the table w/ whatever records are needed before fetching more

[Fetch all the way to the end]

Now SQL + raw data should fetch to the very end ppss.lake_ss.end_ts
Once all the data is fetched and inside _temp_tables, the data is moved to _live_tables

-[x] you can stop/cancel/resume/pause, and things resume correctly and reliably

Yes, I have reviewed the code end-to-end.

It looks to correctly calculate where to start and end
It looks at the CSV data to figure out where to resume fetching from
It copies data from CSV -> temp table before fetching new data, so the new data is ordered correctly
It does not look to be creating any gaps
It looks like it's fetching all the way to the end
If Run 1 Table A completed, Run 2 Table A will pick up a few extra records now that a bit of time has passed
Table A finishes updating before Table B starts
Table A resumes correctly, before Table B starts

-[x] the tables and records are being filled/appended correctly

I have reviewed Table A -> Table B -> Table C: failure/resuming/cancel/pausing/etc.... many many times and it's all working pretty well, reliably, and accurately
I have observed the log output to verify completeness and accuracy many times
[x] there are no gaps in the csv files I haven't quite stressed this, but I see that things are starting/resuming correctly and believe it to be working as expected.
[x] inserting to duckdb starts after all of GQL + CSV has ended It does... and it's also inserted before GQL + CSV resumes such that there are no gaps in the data. I have shared this screenshot above, but _prepare_temp_table() does a great job at backfilling the data before GQL + CSV resumes fetching

All the data from GQL is updated to temp_tables, and the whole job needs to complete succesfully, before rows are added to duckdb.

I believe this is working correctly Screenshot from 2024-05-13 14-36-17

[x] inserting to duckdb cannot start/end part-way Yes, just like above.
[x] raw tables are updated correctly
[x] there are no gaps in raw table

I believe both of these to be correct

KatunaNorbert commented 5 months ago

Issues:

[x] raw udate flow crashes at fetching slots step. Due to this the data is not moved from temp tables to productions table because the flow is not completed #1036
[x] If data all the data gets deleted from production raw table and there is data in csv then the update process breaks with the following error #1038
THIS ONE WILL BE HANDLED LATTER - If some csv files or rows from csv files are getting deleted then those values are going to be refetched and inserted into the corresponding raw production table regardless if the data already exists in the table and ends up with depricated data #1042

KatunaNorbert commented 5 months ago

Fetching the data on the sapphire testnet is not working due to a subgraph issue on the payout data query side which is described inside this issue: #768

idiom-bytes commented 5 months ago

Updates in the latest PR are working well https://github.com/oceanprotocol/pdr-backend/pull/1077

Basically, tables are starting + ending at the same time, reliably across all 4 initial tables (predictions, truevals, payouts, and bronze_predictions). The number of rows/records look correct too. Screenshot from 2024-05-21 09-10-28

idiom-bytes commented 5 months ago

I created tickets were we discovered functionality is missing and are closing this ticket as we have been able to harden the lake end-to-end and the core objectives of this ticket have been achieved.

oceanprotocol / pdr-backend