Closed idiom-bytes closed 5 months ago
Issue 1000! :)
pdr-slots, pdr-subscriptions, and other tables should be removed from the main etl-flow for now
Can we keep this tables so we have all the raw tables working? I don't see how these could slow us down
@KatunaNorbert they have been slowing us down in the testing, iteration, and many other things.
Objective Before: We implemented them because we wanted to move many things in parallel.
Objective Now: We want to pause them now so we can verify things in-order.
checkpoint is identifying the right places to st_ts and end_ts
Yes, I have reviewed the code end-to-end.
[Fetching GQL data from the right place]
[Preloading from CSV for SQL]
_prepare_temp_table()
should fill the table w/ whatever records are needed before fetching more
[Fetch all the way to the end]
ppss.lake_ss.end_ts
-[x] you can stop/cancel/resume/pause, and things resume correctly and reliably
Yes, I have reviewed the code end-to-end.
-[x] the tables and records are being filled/appended correctly
I have reviewed Table A -> Table B -> Table C: failure/resuming/cancel/pausing/etc.... many many times and it's all working pretty well, reliably, and accurately
I have observed the log output to verify completeness and accuracy many times
[x] there are no gaps in the csv files I haven't quite stressed this, but I see that things are starting/resuming correctly and believe it to be working as expected.
[x] inserting to duckdb starts after all of GQL + CSV has ended
It does... and it's also inserted before GQL + CSV resumes such that there are no gaps in the data. I have shared this screenshot above, but _prepare_temp_table()
does a great job at backfilling the data before GQL + CSV resumes fetching
All the data from GQL is updated to temp_tables, and the whole job needs to complete succesfully, before rows are added to duckdb.
I believe this is working correctly
[x] inserting to duckdb cannot start/end part-way Yes, just like above.
[x] raw tables are updated correctly
[x] there are no gaps in raw table
I believe both of these to be correct
Issues:
[x] raw udate flow crashes at fetching slots step. Due to this the data is not moved from temp tables to productions table because the flow is not completed #1036
[x] If data all the data gets deleted from production raw table and there is data in csv then the update process breaks with the following error #1038
THIS ONE WILL BE HANDLED LATTER - If some csv files or rows from csv files are getting deleted then those values are going to be refetched and inserted into the corresponding raw production table regardless if the data already exists in the table and ends up with depricated data #1042
Fetching the data on the sapphire testnet
is not working due to a subgraph issue on the payout data query side which is described inside this issue: #768
Updates in the latest PR are working well https://github.com/oceanprotocol/pdr-backend/pull/1077
Basically, tables are starting + ending at the same time, reliably across all 4 initial tables (predictions, truevals, payouts, and bronze_predictions). The number of rows/records look correct too.
I created tickets were we discovered functionality is missing and are closing this ticket as we have been able to harden the lake end-to-end and the core objectives of this ticket have been achieved.
Motivation
To verify that
We need to improve basic reliability and stability of the lake. The basic duckdb behavior needs to be working as expected.
We should verify things are working by keeping it simple, and focusing on the
bronze_pdr_predictions table
. I am recommending that we ignore pdr-subcriptions, pdr-slots, and possibly other tables so we can validate that the lake is behaving as expected.Verification - Inserting data into the lake and manipulating it
When you first start interacting with the lake, there will be a large a fetch/update step that will try to build everything into the lake. As these records are processed, we begin inserting them into our DB.
lake update
command to start fetching data, and fill the whole lake.Once the lake is built, it's very likely that many records will have
null
entries as they are initially inserted into the database. We are not worried about this for the moment.Test - Cutting off the lake (dropping)
Let's first consider how our lake works. A certain amount of data and events arrive that need to be processed. Each time we do a run, we update a certain amount of records.
Let's say we wanted to drop everything since Run 1. We would call our cli drop command, and get rid of that data.
pdr lake drop 10000001 my_ppss.yaml sapphire-mainnet
Which might be the equivalent of dropping all records since Run 1 -> End. This would include the data from
[Run 2, Run 3]
.The user would continue updating the lake by calling
pdr lake update
... which would refetch and rebuild[Run 2, Run 3]
, getting the system up-to-date, and then continuing on from there,Verifying
We could consider that by dropping/cutting off part of the lake, all tables would have the same data cut-off/rows-dropped like below. Such that the data pipeline can resume from here, and all tables can be updated/resumed from the same "height".
DoD
Testing Data Pipeline Behavior
We need to verify that the basic workflows for inserting data are working. You should be able to do this step-by-step and have the lake and tables working, as expected.
lake update
should just work.Core Components - Raw Table
Core Components - ETL Table