kadena-io / chainweb-data

Data ingestion for Chainweb.
BSD 3-Clause "New" or "Revised" License
14 stars 8 forks source link

Use the new block endpoint for history fill #162

Closed edmundnoble closed 10 months ago

edmundnoble commented 1 year ago

This PR uses chainweb-node's new block headers with payloads endpoint during the history fill operations. Doing so simplifies the code and presumably reduces the CPU overhead over the CW-N and CW-D involved in the process.

The next step is to migrate the listen process from the header stream to the header with payloads stream, but it is not included in this PR.

enobayram commented 10 months ago

@edmundnoble Thank you very much for opening this PR! I've been waiting for the corresponding CW-N endpoints to go into a release and now that they are I'll look into updating, verifying and merging this PR.

enobayram commented 10 months ago

@edmundnoble, I've used a CW-D instance from this branch to backfill the database over night. Today I've noticed that it currently leaves a lot of gaps of blocks ~10-20 heights tall. Seems like this is caused by the fact that we're trying to fetch chunks of 360 blocks by asking for maxheight - minheight = 360 ranges, but CW-N also seems to limit the block endpoint response at exactly 360 items max. The end result is that in the presence of orphans, CW-N sends multiple blocks for some heights, fills up the 360-item limit short of fulfilling the range we asked for. Is this the intended behavior for this endpoint?

I'm planning to fix the issue with CW-D by shrinking our chunks, chunks of 360 is unnecessarily large anyway and I think it also inflates the memory consumption unnecessarily.

edmundnoble commented 10 months ago

Unfortunately, there is no paginated endpoint which includes orphans, which would be the natural way to deal with this issue. You can sort of get around this by asking for maxheight - minheight = 120, because there are almost never more than 2 orphans at a given height.

enobayram commented 10 months ago

I also opted for a chunk size of 120, which solved the problem of gaps as expected. I'll leave this fill running for a little more and look for any issues, otherwise it seems like this PR is very close to completion.

enobayram commented 10 months ago

I think this PR is ready to be merged. I've been running a chainweb-data process from this PR in server mode as well as in fill mode for the last 24 hours or so against an (initially) empty Postgres SCHEMA inside a database that has a chainweb-data instance from the last release populating another SCHEMA for comparing the populated data.

During those 24 hours, chainweb-data managed to backfill roughly 50% of current mainnet. The machine I ran this on has other services running on it, so it's hard to measure fill performance differences, but the populated data doesn't seem to have any issues based on the following checks:

No gaps in the data

I've run the following query to check if there are any gaps in the populated data:

SELECT chainid, height, next_height                                                                                
FROM (                                                                                                                    
    SELECT                                                   
        chainid,                                                                                                          
        height,                                                                                                                                                                                                                                      
        LEAD(height) OVER (PARTITION BY chainid ORDER BY height DESC) AS next_height                                      
    FROM                                                     
        blocks                                               
) AS subquery                                                
WHERE                                                        
    (height - next_height) > 1                               
ORDER BY                                                     
    chainid ASC,                                             
    height DESC LIMIT 20;                                                                                                 

 chainid | height | next_height                              
---------+--------+-------------                             
(0 rows)                                                                                                                  

(Essentially the same query performed by chainweb-data for discovering the gaps in the data)

Populated data is identical to the data from the previous release

SELECT count(*)
     , count(*) FILTER (WHERE (SELECT to_jsonb(b) = to_jsonb(b2) FROM cwd.blocks b2 WHERE b2.hash = b.hash))
     , count(*) FILTER (WHERE NOT (SELECT to_jsonb(b) = to_jsonb(b2) FROM cwd.blocks b2 WHERE b2.hash = b.hash))
FROM blocks b 
;

  count   |  count   | count                                 
----------+----------+-------                                
 34508643 | 34498688 |     0                                 
(1 row)                                                      

Seems like we've ended up with some additional blocks (I think this new endpoint provides all orphans seen), but all the blocks we've copied that are common with the release CW-D have all of their fields identical.

SELECT count(*)
     , count(*) FILTER (WHERE (SELECT to_jsonb(t) = to_jsonb(t2) FROM cwd.transactions t2 WHERE t2.block = t.block AND t2.requestkey = t.requestkey))
     , count(*) FILTER (WHERE NOT (SELECT to_jsonb(t) = to_jsonb(t2) FROM cwd.transactions t2 WHERE t2.block = t.block AND t2.requestkey = t.requestkey))
FROM transactions t 
WHERE t.height > 4100000
;

  count   |  count   | count 
----------+----------+-------
 22884000 | 22825320 |     0
(1 row)

Similar query for a subset of the transactions, no discrepancy found.

The query was a little trickier for the events table, because that table contains the paramtext column, which is rendered from JSON to string on the Haskell side, and the field order changed between versions. That's why I had to remove that from the json representation we use for comparison in order to avoid false negatives:

SELECT count(*)
     , count(*) FILTER (WHERE (SELECT to_jsonb(e) - 'paramtext' = to_jsonb(e2) - 'paramtext' FROM cwd.events e2 WHERE e2.block = e.block AND e2.requestkey = e.requestkey AND e2.idx = e.idx))
     , count(*) FILTER (WHERE NOT (SELECT to_jsonb(e) - 'paramtext' = to_jsonb(e2) - 'paramtext' FROM cwd.events e2 WHERE e2.block = e.block AND e2.requestkey = e.requestkey AND e2.idx = e.idx))
FROM events e 
WHERE e.height > 4300000
;
  count  |  count  | count 
---------+---------+-------
 7695345 | 7674529 |     0
(1 row)