protocol / prodeng

Issues, discussions and documentation from the production engineering team
2 stars 1 forks source link

Deploy instrumented block system for evaluation #7

Closed iand closed 2 years ago

iand commented 2 years ago

This is an analysis of about 48 hours of activity on a single instrumented gateway instance covering 48291 blocks.

The blocks were ranked in descending order of the LNC-R-WS-U profit function: (citeseer or dweb.link )

pi = ri*di / si

where:

The goal of the profit function is to maximise delay savings metric (DSR) which represents the fraction of communication delays saved by satisfying requests from cache. Keeping slow to fetch, but frequently requested blocks in the cache can increase data locality which will reduce the time to first byte when serving blocks.

Mean Fetch Times (in seconds)

The last three timings for fetching each block from the network via bitswap were recorded. The vast majority of blocks (44550) were fetched only once, ~7% (3305) were fetched twice and <1% (436) were fetched three times. This is controlled by the number of garbage collections performed by the blockstore in the sample period.

The median/mean/min/max for all 48291 is shown in the first column. The second column shows the blocks with the lowest profit. Low profit blocks would be candidates for removal by garbage collection and the table demonstrates that these blocks can be rapidy re-fetched if necessary with a mean and median below 50ms.

The top 100 and top 1000 blocks by profit would be ones retained in the blockstore by garbage collection and the table shows that these are the slowest blocks to fetch by mean and median, both of which are greater than 10 seconds. Retaining these blocks means a response can be returned to the user in milliseconds rather than tens of seconds.

        All Blocks  Bottom 1000  Top 1000    Top 100
Median:     0.463        0.03     10.432     11.975
Mean:       6.845        0.04     28.484     65.833
Min:        0.005        0.01      0.119      1.056
Max:     2855.120        0.29   2855.120   2855.120

Reference Rate (per hour)

The last three times at which each block was accessed to service a request by a user were recorded. Less than a third of blocks (13494) were accessed only once, a third were accessed twice (14810) and 41% were accessed at least three times (19987). Note that blocks may be accessed from the blockstore multiple times as part of different dag traversals so this doesn't necessarily indicate that users are re-requesting the same blocks.

An estimate of the request rate per hour was derived from the three recorded values. Once again the median/mean/min/max for all 48291 is shown in the first column. Most blocks were accessed less than once per hour, the highest around once per minute.

The 1000 blocks with the lowest profit are shown in the second column. These are very rarely re-accessed, less than once per day. This suggests that these are good candidates for removal by garbage collection.

The top 100 and top 1000 blocks by profit are among the highest repeatedly accessed blocks, the highest being 10-13 times per hour by median and mean. Scoring these blocks highly means the garbage collector will tend to keep them in the block store, avoiding the cost of refetching.

        All Blocks  Bottom 1000  Top 1000  Top 100
Median:    0.089         0.029     2.018    9.409
Mean:      0.962         0.033     5.099   13.828
Min:       0.019         0.019     0.068    0.179
Max:      59.120         0.251    56.125   49.836

Block Sizes (in bytes)

The direct byte size of each block was recorded. Ideally we would record the size of the dag for which the block is the root but it was not implemented as part of this test.

The profit formula scores larger blocks lower on the basis that purging larger blocks during garbage collection can free up more space while retaining higher numbers of other useful blocks.

It was not clear from the outset whether this would affect the scoring of frequently used, slow to fetch blocks but the previous two sections indicate that these highly valuable blocks are still scored highly and would be retained by the garbage collector

        All Blocks  Bottom 1000  Top 1000  Top 100
Median:      1792       208900       158      152
Mean:       51647       259957       309      303
Min:            6        33695         6       32
Max:      1044411      1037807      4666     1810
Sum:   2494084943    285953230    309010    30373

Conclusion

The profit function appears to rank blocks well such that highly accessed but costly to fetch blocks are scored highly, while infrequently accessed and fast to replace blocks are scored low. These low scoring blocks would be first to be deleted by a garbage collector that ranked by profit. The results also indicate that deleting by ascending profit score will free the largest amount of space by deleting the fewest useful blocks, improving data locality and hence time to first block.