Debug why nodes have to repair shreds despite large erasure batches

behzadnouri commented 2 years ago

Problem

Sherds are propagated using 32:32 data:code erasure batches. Despite that even at times that the delinquent stake is pretty low and the cluster is working fine, we still observe very small yet non-zero repair shreds on the cluster. Similarly on a small GCE cluster, repair metrics are yet non-zero.

One possibility is that repair is too aggressive and some repaired shreds are ultimately received from Turbine anyways or recovered through erasure codes.

Proposed Solution

Debug why nodes still repair shreds even when delinquent stake is low and cluster is not overloaded.

bw-solana commented 2 years ago

I'm observing on the order of 100 shred repair requests being issued per second, but the vast majority of these end up not getting inserted into the blockstore because they already exist (propagated through turbine or recovery eventually). Only ~1 repair per second actually gets inserted into blockstore.

As an experiment, I increased the MAX_TURBINE_PROPAGATION_IN_MS constant, which controls how long we'll wait after the "expected" shred arrival time before issuing a repair request. Increasing this time had a large impact on the total number of repairs issued as well as the number of necessary repairs as shown by table below:

Note that @steviez collected a similar set of data in #28634 that shows repaired shreds colliding with turbine/recovery

I believe this answers a large part of the question posed by this issue. What is still not clear to me is if the remaining successful repairs are absolutely necessary (or turbine/recovery would have got those shreds into blockstore eventually) and if the number makes sense given expected packet loss.

bw-solana commented 1 year ago

A few more interesting observations:

I'm seeing that a small amount of repairs completing successfully are requested via the HighestShred repair type through the generate_repairs_for_slot function. These repairs seem unnecessary as we haven't waited very long at all for turbine shreds to show up.
I'm seeing ~0.1% of erasure sets experienced at least 1 of their shreds being repaired. If you trust my binomial distribution probability (you shouldn't :) ), this would indicate packet success rate of ~68%. This seems awfully low, and leads me to believe packet success within an erasure set has some amount of correlation for whatever reason.
I'm seeing ~5% of slots experiencing one successful repair insertion.
Of slots with a successful repair insertion, the vast majority of them require only 1 or 2 shreds to be repaired.

Here's an interesting log snippet detailing the evolution of slot 168239177, FEC 96, which ultimately had 1 successful repair to shred 109:

[2022-12-21T20:12:10.228677416Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 31 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.242342641Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 30 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.257640999Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 29 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.284582242Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 28 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.289387464Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 27 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.294383968Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 25 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.295578511Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 24 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.299210581Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 23 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.300530091Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 22 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.302735578Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 21 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.308923790Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 20 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.312854886Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 19 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.313418614Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 18 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.321557781Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 17 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.322756622Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 16 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.322915874Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 15 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.325492725Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 14 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.335898856Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 13 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.361718104Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 12 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.363473669Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 11 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.448763466Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 10 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.500319796Z WARN  solana_core::repair_service] BWLOG: generate_repairs_for_slot requesting 21 repairs: [96, 97, 98, 100, 102, 103, 104, 105, 109, 112, 113, 114, 116, 117, 118, 121, 122, 123, 125, 126, 127] from slot 168239177. 328 ms since first shred, consumed = 96, received = 479, completed = {0, 1, 2, 7, 8, 9, 26, 31, 34, 37, 39, 41, 42, 44, 46, 50, 51, 53, 54, 57, 59, 61, 64, 66, 70, 86, 88, 89, 90, 94, 95, 106, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 142, 143, 145, 146, 148, 149, 169, 171, 172, 174, 175, 176, 178, 180, 181, 183, 184, 186, 187, 189, 190, 191, 192, 193, 194, 196, 198, 199, 200, 201, 202, 203, 222, 224, 225, 226, 228, 230, 231, 232, 234, 235, 236, 238, 241, 244, 264, 265, 267, 269, 270, 271, 272, 273, 274, 276, 277, 278, 299, 300, 301, 302, 304, 305, 307, 310, 311, 312, 314, 315, 316, 317, 318, 319, 320, 321, 323, 325, 326, 327, 347, 349, 350, 352, 353, 355, 356, 358, 359, 360, 362, 363, 365, 367, 368, 370, 371, 372, 373, 393, 394, 395, 396, 398, 399, 400, 401, 402, 404, 405, 407, 436, 464, 478}
[2022-12-21T20:12:10.553677206Z WARN  solana_ledger::blockstore] BWLOG: successful repair insert for slot 168239177, shred 109
[2022-12-21T20:12:10.553682957Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 9 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.586413656Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 8 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.595120939Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 7 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.596805671Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 6 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.604588493Z WARN  solana_core::repair_service] BWLOG: generate_repairs_for_slot requesting 20 repairs: [96, 97, 98, 100, 102, 103, 104, 105, 112, 113, 114, 116, 117, 118, 121, 122, 123, 125, 126, 127] from slot 168239177. 432 ms since first shred, consumed = 96, received = 514, completed = {0, 1, 2, 7, 8, 9, 26, 31, 34, 37, 39, 41, 42, 44, 46, 50, 51, 53, 54, 57, 59, 61, 64, 66, 70, 86, 88, 89, 90, 94, 95, 106, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 142, 143, 145, 146, 148, 149, 169, 171, 172, 174, 175, 176, 178, 180, 181, 183, 184, 186, 187, 189, 190, 191, 192, 193, 194, 196, 198, 199, 200, 201, 202, 203, 222, 224, 225, 226, 228, 230, 231, 232, 234, 235, 236, 238, 241, 244, 264, 265, 267, 269, 270, 271, 272, 273, 274, 276, 277, 278, 299, 300, 301, 302, 304, 305, 307, 310, 311, 312, 314, 315, 316, 317, 318, 319, 320, 321, 323, 325, 326, 327, 347, 349, 350, 352, 353, 355, 356, 358, 359, 360, 362, 363, 365, 367, 368, 370, 371, 372, 373, 393, 394, 395, 396, 398, 399, 400, 401, 402, 404, 405, 407, 428, 430, 432, 433, 434, 436, 437, 438, 439, 441, 442, 443, 444, 445, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 480, 481, 483, 484, 486, 507, 508, 509, 512, 513}
[2022-12-21T20:12:10.610407315Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 5 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.622011796Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 4 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.645742447Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 3 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.650798444Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 2 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.651050301Z WARN  solana_ledger::blockstore] BWLOG: StillNeed 1 ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.652664910Z WARN  solana_ledger::blockstore] BWLOG: CanRecover ErasureSetId(168239177, 96) Index { slot: 168239177, data: ShredIndex { index: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 99, 101, 106, 107, 108, 109, 110, 111, 115, 116, 119, 120, 121, 124, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513} }, coding: ShredIndex { index: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 67, 68, 69, 71, 72, 73, 74, 75, 82, 84, 87, 88, 91, 95, 96, 97, 98, 99, 100, 101, 103, 104, 106, 110, 113, 114, 115, 116, 119, 122, 124, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 158, 159, 160, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 276, 277, 278, 279, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 387, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 403, 404, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 447, 448, 449, 450, 451, 452, 453, 454, 456, 457, 458, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 482, 487, 496, 509, 513, 514, 515, 516, 517, 519, 520, 521, 522, 524, 525, 527, 528, 529, 530, 531, 534, 535, 536, 537, 538, 540, 541, 542, 543, 545, 546, 547, 548, 551, 552, 553, 554, 555, 556, 557, 558, 559, 562, 564, 565, 566, 567, 569, 571, 572, 573} } }
[2022-12-21T20:12:10.654743738Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.656298934Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.656995814Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.662231711Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.664292144Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.665374124Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.667918614Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.672956958Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.682359378Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.700957196Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.739988039Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.740975049Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.842044797Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)
[2022-12-21T20:12:10.895988728Z WARN  solana_ledger::blockstore] BWLOG: DataFull ErasureSetId(168239177, 96)

bw-solana commented 1 year ago

2 actions that can be taken to cut down on repair related network traffic:

Increase MAX_TURBINE_PROPAGATION_IN_MS so that we wait for longer for shreds to show up through turbine before requesting repair. I was seeing large reduction in requests without any visible impact to slot full times when increasing from 100 to 200
Only request the highest shred repair type through generate_repairs_for_slot when the shred has surpassed the turbine timeout

behzadnouri commented 1 year ago

Are the stats gathered in https://github.com/solana-labs/solana/issues/28638#issuecomment-1362099338 with current 100ms MAX_TURBINE_PROPAGATION_IN_MS or some other setup?

packet success within a shred has some amount of correlation for whatever reason.

within a "batch"? not "shred", right?

Increase MAX_TURBINE_PROPAGATION_IN_MS so that we wait for longer for shreds

any idea how far we can push this before it starts negatively impacting node? also, wondering if we can design a more robust mechanism than just an arbitrary wait time.

bw-solana commented 1 year ago

Are the stats gathered in #28638 (comment) with current 100ms MAX_TURBINE_PROPAGATION_IN_MS or some other setup?

They are gathered with MAX_TURBINE_PROPAGATION_IN_MS set to 200. However, I also have stats gathered with the default 100, and the output looks almost the same. The only real visible change I can pick up on is fewer repair requests being made, leading to fewer blockstore insertion failures for "already exists."

within a "batch"? not "shred", right?

Good catch. I think I meant to say "erasure set" here.

any idea how far we can push this before it starts negatively impacting node? also, wondering if we can design a more robust mechanism than just an arbitrary wait time.

Not completely sure. 200 seems to cut down the noise by an order of magnitude without impacting slot times. I previously saw further repair request reduction moving to 400, but I didn't capture block insertion times for that run.

I actually don't hate the current wait mechanism. It uses a reference time for each shred as an offset from the first shred received time + some arbitrary turbine delay time.

pgarg66 commented 3 months ago

This is another idea we can try.

On a given node (repairer), monitor how many repaired shreds are already in the blockstore by the time repaired shred is received. We are already tracking it.
If the ratio of already present shred count / repaired shred count goes above a certain threshold, slow down on repair requests.
If the ratio goes down a certain threshold, make repair requests more aggressive.
This can be done by replacing MAX_TURBINE_PROPAGATION_IN_MS with a dynamically calculated value in the repair algorithm. Or, by probably a cleaner refactor of the code.

The idea is that each node has a somewhat unique network connectivity (latency/throughput/jitter). And, this dynamic adjustment will allow it to monitor and tune the repairs.

Thoughts? @behzadnouri, @bw-solana

bw-solana commented 3 months ago

This is another idea we can try.

On a given node (repairer), monitor how many repaired shreds are already in the blockstore by the time repaired shred is received. We are already tracking it.

If the ratio of already present shred count / repaired shred count goes above a certain threshold, slow down on repair requests.

If the ratio goes down a certain threshold, make repair requests more aggressive.

This can be done by replacing MAX_TURBINE_PROPAGATION_IN_MS with a dynamically calculated value in the repair algorithm. Or, by probably a cleaner refactor of the code.

The idea is that each node has a somewhat unique network connectivity (latency/throughput/jitter). And, this dynamic adjustment will allow it to monitor and tune the repairs.

Thoughts? @behzadnouri, @bw-solana

Makes sense to me. We should try it out on a few nodes.

What's a good target redundant repair shred rate? 50%?

pgarg66 commented 3 months ago

I can hack up something and try on monogon cluster first.

behzadnouri commented 3 months ago

I am not against trying this out and see if it results in an improvement or any interesting insights.

However please keep in mind that these kinds of dynamic adjustments tend to become very difficult to predict their behavior or debug when things break, specially when there are a lot of other dynamic components or moving parts that they interact with.

pgarg66 commented 3 months ago

I am not against trying this out and see if it results in an improvement or any interesting insights.

However please keep in mind that these kinds of dynamic adjustments tend to become very difficult to predict their behavior or debug when things break, specially when there are a lot of other dynamic components or moving parts that they interact with.

Noted. Would keeping dynamism in check with upper/lower bounds be a way forward?

behzadnouri commented 3 months ago

Would keeping dynamism in check with upper/lower bounds be a way forward?

sure, lets run some experiments and evaluate. thank you

pgarg66 commented 3 months ago

I added some counters to see how many duplicate shreds are because of repair. Running on monogon, with 100 nodes, 1 client, and --bench-tps-args 'tx-count=15000 threads=4 thread-batch-sleep-ms=0' produces this result.

This counter is in core/src/window_service.rs, in PossibleDuplicateShred::Exists(shred) => with some tweaks to pass along the repaired status.

I noticed some deduping in sigverify_stage as well. But in my experiments, none of those are due to repairs. But, that deduping is happening within a smaller time window. So it may not catch anything relevant wrt repairs.

Any thoughts about the analysis?

bw-solana commented 3 months ago

I added some counters to see how many duplicate shreds are because of repair. Running on monogon, with 100 nodes, 1 client, and --bench-tps-args 'tx-count=15000 threads=4 thread-batch-sleep-ms=0' produces this result.

This counter is in core/src/window_service.rs, in PossibleDuplicateShred::Exists(shred) => with some tweaks to pass along the repaired status.

I noticed some deduping in sigverify_stage as well. But in my experiments, none of those are due to repairs. But, that deduping is happening within a smaller time window. So it may not catch anything relevant wrt repairs.

Any thoughts about the analysis?

Is this essentially the same as blockstore-insert-shreds.num_repaired_data_shreds_exists?

Looks like spikes around 250 shreds. Is this a summation, average? What's the grouping interval?

behzadnouri commented 3 months ago

I think it would be helpful if the plot shows the total number of repair shreds as well. Like if the plot shows 200, out of how many incoming repair shreds 200 of them were redundant.

I noticed some deduping in sigverify_stage as well. But in my experiments, none of those are due to repairs. But, that deduping is happening within a smaller time window. So it may not catch anything relevant wrt repairs.

Because repair shreds have a trailing unique nonce, I believe the deduper in sigverify-stage does not filter them out (aside from the false positive rate of the deduper).

pgarg66 commented 3 months ago

I added some counters to see how many duplicate shreds are because of repair. Running on monogon, with 100 nodes, 1 client, and --bench-tps-args 'tx-count=15000 threads=4 thread-batch-sleep-ms=0' produces this result.

This counter is in core/src/window_service.rs, in PossibleDuplicateShred::Exists(shred) => with some tweaks to pass along the repaired status. I noticed some deduping in sigverify_stage as well. But in my experiments, none of those are due to repairs. But, that deduping is happening within a smaller time window. So it may not catch anything relevant wrt repairs. Any thoughts about the analysis?

Is this essentially the same as blockstore-insert-shreds.num_repaired_data_shreds_exists?

Looks like spikes around 250 shreds. Is this a summation, average? What's the grouping interval?

Thanks for pointing out. Yes it's essentially the same as blockstore-insert-shreds.num_repaired_data_shreds_exists. Great that we already have a counter for it.

The graph is just printing values as is, with no formula. Counter's metricrate is 1.

pgarg66 commented 3 months ago

I think it would be helpful if the plot shows the total number of repair shreds as well. Like if the plot shows 200, out of how many incoming repair shreds 200 of them were redundant.

I noticed some deduping in sigverify_stage as well. But in my experiments, none of those are due to repairs. But, that deduping is happening within a smaller time window. So it may not catch anything relevant wrt repairs.

Because repair shreds have a trailing unique nonce, I believe the deduper in sigverify-stage does not filter them out (aside from the false positive rate of the deduper).

I'll take a look. Thanks @behzadnouri

pgarg66 commented 3 months ago

The total repaired shred received by the fetch stage

The shreds which already existed in the blockstore

And, if apply mean as the formula. Total repaired shreds

Duplicates

So some good percentage of shreds are actually duplicate. The percentage varies over time.

bw-solana commented 3 months ago

So some good percentage of shreds are actually duplicate. The percentage varies over time.

This seems inline with results we've seen in the past.

One thing I wonder for the duplicate repaired shreds is how many fall into each of these cases:

It's duplicate because we didn't wait long enough and the shred eventually showed up from turbine (after requesting repair but before the repair arrived)
It's duplicate because some other repair requests came in and then we could repair the rest of the erasure batch
Something else

(2) is the one that really has me thinking. I'm wondering if we should limit outstanding repair requests to what we need to kick things over to recovery.

pgarg66 commented 3 months ago

(2) is a fair point, as erasure recovery can fill in the missing shreds (for which the repairs were already requested). Not sure if there's any clever way to detect it though. I don't think we store any recovery related information in blockstore today.

behzadnouri commented 3 months ago

I am guessing this is still way underestimating number of redundant repair requests; or something else is going on that we don't yet understand.

According to this binomial distribution calculator: https://stattrek.com/online-calculator/binomial With 32:32 batches, even if 25% of shreds are not delivered (which is way too large for a monogon cluster), the probability of recovering the batch is still effectively 100%. So I would expect non-redundant repairs should almost be zero.

binom-2

pgarg66 commented 3 months ago

I'll do some more digging.

A few points to note

The redundant repairs are ~0 when bench-tps is not running. Only votes are going through, so about 250 TPS (100 nodes)
The redundant repairs spike when bench-tps initially kicks in.

How is the topology of the monogon cluster? Are nodes geographically distributed?

solana-labs / solana

Debug why nodes have to repair shreds despite large erasure batches #28638

Problem

Proposed Solution