oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
22 stars 15 forks source link

[ETL] Incremental Pipeline #1001

Open idiom-bytes opened 1 month ago

idiom-bytes commented 1 month ago

Motivation

We have now verified that the basic lake functionality is working as expected.

We now want to verify the data quality and completeness.

This means that additional SQL queries are being run, such that more tables are being processed and richer data is being generated.

Update Step - Incrementally updating the Lake

When you run the "lake update" command, later SQL queries are responsible for updating w/ the most recent information.

  1. When the lake updates, new records have arrived that need to be processed
  2. These new records (such as pdr_payout) if applicable should be: (a) cleaned up into their raw/bronze table, (b) update other tables to reflect this event arriving
  3. After all records have been yielded to temp tables and the pipeline ends, records should then be available on live/production tables.

Screenshot from 2024-05-06 13-27-25

Data Workflows All data workflows should operate in the same way.

  1. All data that needs to be written out, is first written into a temp table.
  2. As temp tables are created w/ new data, views are available so that downstream queries can access both old and new data from a single query.
  3. Once all the processes have completed and data is written out to temp tables, we can do a final merge/update rows into final/live/production tables.

DoD:

Task:

idiom-bytes commented 4 weeks ago

To implement this ticket, we should first start w/ simply updating predictions when truevals and payouts show up

[How this ticket grows]

In the future...

subscription event -> new subscription record

slot event -> new slot record

prediction event -> new bronze prediction record -> update bronze slot record

trueval event -> new trueval record -> update N bronze prediction records -> update 1 bronze slot record

payout event -> new payout record -> update N bronze prediction records -> update 1 bronze slot record

idiom-bytes commented 4 weeks ago

Here is one of my ways...

  1. We update ETL so tables aren't attached to queries, it's just a set of queries
  2. We then add multiple queries and another flow ("_update" tables) so we can reconcile everything.

1- process predictions

2- process payouts

3 - Reduce updates

Although this takes a couple of extra steps, the overall amount of rows scanned/computed/joined, is far lower... Increasing the overall performance of the workflow.

Most of this work should look like SQL queries and a swap logic update at the end of the ETL update logic.

1 - Extract prediction update events from trueval

1_extract_truevals

2 - Extract prediction update events from payout

2_extract_payouts

3 - Prepare prediction updates and merge to final table

3_update_to_final

All prediction events per source

Note that in the end, we should expect a smaller number of payouts relative to predictions made, and a lot of bronze_predictions with null payouts. But, 100% of all payouts should be registered in the bronze_predictions table. 4_all_prediction_events

idiom-bytes commented 3 weeks ago

[Feedback Mustafa] With reference to the code/design provided, as I explained to Mustafa after reviewing his proposal.

[Effective Processing of Events] I have instead, done a pseudo-implementation of the SQL queries + logic required to get this working.

[Simplify Requirements even Further] I have also emphasized how much simpler all of this can be to deliver on the goal of: predictoor revenue dashboard by not requiring the trueval table.

Trueval does not contain the user id anywhere, so it cannot update the prediction table directly. First, it will need to update a slot, and we're not caring about that at the moment.

Literally, all we need is to join payouts with predictions. The rest can come later.