Closed bboston7 closed 4 months ago
Posting an update here where it's more visible rather than buried in a comment thread.
First, the discrepancy in performance between the old and new clusters was a mistake on my end: I was running a buildtests
image on the new cluster, which has asan enabled, and a perftests
image on the old cluster, which has asan disabled. On a perftests
image I'm seeing a max TPS on these tests of around 250. Note that this is a build with some extra debug info added in, and I haven't yet updated the distributions, so that TPS number will change. However, it should be in the right ballpark.
Second, I think I'm at the point where limits aren't the problem anymore. Ultimately these runs fail because a node (or nodes) start experiencing many txBAD_SEQ
errors, triggering a loadgen failure. These are caused by the transaction queue dropping transactions, but I don't believe this is due to insufficient limits. Even when raising the limits comically high (40x what we'd expect in a single ledger) this failure mode remains and the max TPS is unchanged. Instead, I think the transaction queue dropping transactions is a symptom of another problem. Shortly before a node starts dropping transactions, its ledger age goes up beyond the rest of the network. At a ledger age of ~35s the queue starts dropping transactions. At a ledger age of ~50s the rate of dropped transactions is so high that loadgen cannot keep up with expected sequence numbers and the run fails. What's interesting is not all nodes experience these high ledger age values and subsequent dropping of transactions.
Take this example run at 300 TPS with limits set at 40x what you'd expect to see in a single ledger. At around 15:02:45 node bd-0
starts to see ledger age times pull away from the rest of the nodes:
Around the same time bd-0
's ledger age starts climbing, bd-0
also starts seeing demand timeouts:
About a minute later, bd-0
starts banning transactions:
Where that graph flatlines around 15:13:45, loadgen fails on bd-0
and it stops generating new transactions.
You can also see bd-0
make and abandon far more demands than the rest of the network:
This pattern is consistent across multiple runs. Sometimes it's more than one node that fails in this way.
All of this leads me to believe that it's not a limits issue, but rather something else is causing some nodes to fail to keep up with ledgers as they are broadcast. I suspect it's just that the overlay is overwhelmed, but I'm not sure why this affects some nodes and not others. Maybe it just comes down to the overlay topology making some nodes more susceptible to this issue?
@SirTyson, I'm curious what you think about all of this. Does this seem like a "legitimate" loadgen failure? Or the result of a misconfiguration (such as using the wrong limits)? Also, does ~250TPS sound like a plausible outcome here? Keep in mind that ~125 of those transactions per second are soroban transactions, meaning that a soroban transaction is ~7x more resource intensive than a payment (assuming a TPS of 900 for pay transactions).
I decided to see how the old max TPS test that uses only pay transactions fails and saw the same exact failure pattern in this run at 900 TPS. The pattern isn't as dramatic, but it's still there:
First, bd-0
's (in green) ledger age and demand timeouts start creeping up:
Then, about a minute later, bd-0
starts banning transactions:
Loadgen fails on bd-0
where that graph levels out. Like the mixed mode runs, the other nodes keep going just fine, probably due to the load decreasing network-wide when bd-0
stops generating load.
This all leads me to believe that the reason for loadgen failures is the same in both mixed and classic mode. I still suspect it's the overlay getting overwhelmed, and the TPS is just lower for mixed mode because the messages are larger.
Related: when running the new test, is the ledger close time metric (ledger.ledger.close) consistent with the same classic TPS? I opened https://github.com/stellar/stellar-core/issues/4365 recently, so I'm wondering if there's any overhead with Soroban ops we're not accounting for in our simulations. E.g. if there's an extra overhead for Soroban vs classic close time, this does impact the resulting TPS.
when running the new test, is the ledger close time metric (ledger.ledger.close) consistent with the same classic TPS?
No, it looks like the average ledger close time is a bit higher on the mixed mode run, and the max is much higher. Here's the metric from a mixed mode run:
,"ledger.ledger.close":{
"type":"timer",
"count":16,
"event_type":"calls",
"rate_unit":"s",
"mean_rate":0.139674,
"1_min_rate":0.199989,
"5_min_rate":0.191766,
"15_min_rate":0.17716,
"duration_unit":"ms",
"min":0.532251,
"max":3392.97,
"mean":553.548,
"stddev":1213.05,
"sum":8856.77,
"median":1.70122,
"75%":2181.19,
"95%":3271.87,
"98%":3271.87,
"99%":3271.87,
"99.9%":3271.87,
"100%":3271.87
}
And here's one from a classic run:
,"ledger.ledger.close":{
"type":"timer",
"count":148,
"event_type":"calls",
"rate_unit":"s",
"mean_rate":0.150299,
"1_min_rate":0.183972,
"5_min_rate":0.186444,
"15_min_rate":0.190643,
"duration_unit":"ms",
"min":0.541953,
"max":601.459,
"mean":413.158,
"stddev":104.403,
"sum":61147.4,
"median":464.925,
"75%":477.388,
"95%":553.14,
"98%":553.14,
"99%":553.14,
"99.9%":553.14,
"100%":553.14
}
Closes #142 and #106
This change adds a new mission
MaxTPSMixed
that performs a binary search to determine the max TPS under a mix of classic and soroban traffic.I tested these changes by scaling some things back (smaller network, lower timing intervals, etc) and verifying that the binary search works as expected. The test is able to correctly detect that loadgen fails on TPS runs that are too high.