[ETL] Incremental Pipeline

Motivation

We have now verified that the basic lake functionality is working as expected.

We now want to verify the data quality and completeness.

This means that additional SQL queries are being run, such that more tables are being processed and richer data is being generated.

This means that more SQL queries are being run in the ETL step
This means that slot and other tables are being processed
This means that tables like pdr_payouts cause bronze_predictions to be updated
This means that null entries inside bronze_predictions are eventually updated

Update Step - Incrementally updating the Lake

When you run the "lake update" command, later SQL queries are responsible for updating w/ the most recent information.

When the lake updates, new records have arrived that need to be processed
These new records (such as pdr_payout) if applicable should be: (a) cleaned up into their raw/bronze table, (b) update other tables to reflect this event arriving
After all records have been yielded to temp tables and the pipeline ends, records should then be available on live/production tables.

Screenshot from 2024-05-06 13-27-25

Data Workflows All data workflows should operate in the same way.

All data that needs to be written out, is first written into a temp table.
As temp tables are created w/ new data, views are available so that downstream queries can access both old and new data from a single query.
Once all the processes have completed and data is written out to temp tables, we can do a final merge/update rows into final/live/production tables.

DoD:

[ ] Tables like truevals and payouts are being processed
[ ] Bronze prediction is being updated as a result of truevals and payouts being processed
[ ] Other tables and bronze tables are currently not processed

Task:

[ ] Process new pdr-payouts into duckdb
[ ] Process new pdr-truevals into duckdb
[ ] Verify the incremental update step works #982
[ ] Create SQL that process new pdr-payouts into update bronze-predictions
[ ] Create SQL that process new pdr-truevals into update bronze-predictions
[ ] Verify that null records inside bronze-predictions are being updated correctly
[ ] Verify everything is working e2e

To implement this ticket, we should first start w/ simply updating predictions when truevals and payouts show up

[How this ticket grows]

In the future...

Basically 1 Event -> Multiple Table Inserts & Updates
We need to consider what each event is doing, across all tables.
We need to make sure that each event, leads to all new records being generated and existing records being updated

subscription event -> new subscription record

slot event -> new slot record

prediction event -> new bronze prediction record -> update bronze slot record

trueval event -> new trueval record -> update N bronze prediction records -> update 1 bronze slot record

payout event -> new payout record -> update N bronze prediction records -> update 1 bronze slot record

Here is one of my ways...

We update ETL so tables aren't attached to queries, it's just a set of queries
We then add multiple queries and another flow ("_update" tables) so we can reconcile everything.

1- process predictions

We can simplify the query to just the insert logic, we don't need to join anymore
We then process truevals
We can just insert new ones to _bronze_truevals if we want to, but we're not doing that right now... So skip
We want to create update events for the bronze_prediction table, so we extract the id, slot, trueval, and add it to the update_bronze_prediction

2- process payouts

We can just insert new ones to _bronze_payouts if we want to, but we're not doing that right now... So skip
We want to create update events for the bronze_prediction table, so we extract the id, slot, payout, and add it to the update_bronze_prediction

3 - Reduce updates

We now have all update rows generated, and can begin the swap/finalize process.
We then have the final step, where we "reduce", or join _etl_bronze_predictions with _update_bronze_predictions such that we can get all rows written to the final/live bronze_predictions table.

Although this takes a couple of extra steps, the overall amount of rows scanned/computed/joined, is far lower... Increasing the overall performance of the workflow.

Most of this work should look like SQL queries and a swap logic update at the end of the ETL update logic.

1 - Extract prediction update events from trueval

1_extract_truevals

2 - Extract prediction update events from payout

2_extract_payouts

3 - Prepare prediction updates and merge to final table

3_update_to_final

All prediction events per source

Note that in the end, we should expect a smaller number of payouts relative to predictions made, and a lot of bronze_predictions with null payouts. But, 100% of all payouts should be registered in the bronze_predictions table. 4_all_prediction_events

[Feedback Mustafa] With reference to the code/design provided, as I explained to Mustafa after reviewing his proposal.

doing an onEventHandler approach will likely lead to more scans/compute than required
the way it's recommended, it looks to be doing more copies than needed

[Effective Processing of Events] I have instead, done a pseudo-implementation of the SQL queries + logic required to get this working.

It only reads from each raw tables once (small, incoming tables)
It only joins with each bronze_table once (large historical tables)
rather than joining all update records, it tries to group/reduce them together to have only 1:1 join with the large historical table

[Simplify Requirements even Further] I have also emphasized how much simpler all of this can be to deliver on the goal of: predictoor revenue dashboard by not requiring the trueval table.

Trueval does not contain the user id anywhere, so it cannot update the prediction table directly. First, it will need to update a slot, and we're not caring about that at the moment.

Literally, all we need is to join payouts with predictions. The rest can come later.

oceanprotocol / pdr-backend