Find Optimizations in the current Clarity VM via benchmarking current VM and Wasm

stacks-network / stacks-core

The Stacks blockchain implementation

https://docs.stacks.co

GNU General Public License v3.0

3k stars 660 forks source link

Find Optimizations in the current Clarity VM via benchmarking current VM and Wasm #4316

Open saralab opened 5 months ago

saralab commented 5 months ago

### Tasks
- [x] Generate new costs contract for interpreter and analyze what's changed
- [ ] Profile the interpreter to see where the bottlenecks are
- [ ] Open specific issues to further investigate results from 2 and optimize where possible
- [ ] Generate costs contract based on clarity-wasm
- [ ] Profile clarity-wasm to see where the bottlenecks are
- [ ] Open specific issues to further investigate results from 5 and optimize where possible

jbencin commented 5 months ago

Summary of differences from `costs-3.clar` to `costs-4.clar`

Legend

Symbol	Meaning
+	Minor cost increase
++	Cost increase
+++	Major cost increase
-	Minor cost decrease
--	Cost decrease
---	Major cost decrease

Costs

Only the functions that have different cost values are reported here:

Function	`costs-3`	`costs-4`	Interpretation
`cost_analysis_type_annotate`	`(runtime (linear n u1 u9))`	`(runtime (linear n u1 u10))`	+
`cost_analysis_type_check`	`(runtime (linear n u113 u1))`	`(runtime (linear n u115 u0))`	?
`cost_analysis_list_items_check`	`(runtime (linear n u2 u4))`	`(runtime (linear n u2 u8)))`	+
`cost_analysis_check_tuple_cons`	`(runtime (nlogn n u3 u5))`	`(runtime (nlogn n u3 u142))`	+++
`cost_analysis_tuple_items_check`	`(runtime (linear n u1 u28))`	`(runtime (linear n u2 u52))`	++
`cost_analysis_lookup_variable_const`	`(runtime u15)`	`(runtime u16)`	+
`cost_ast_cycle_detection`	`(runtime (linear n u141 u72))`	`(runtime (linear n u156 u23))`	?
`cost_analysis_fetch_contract_entry`	`read_length: (linear n u1 u1)`	`read_count: (linear n u1 u1)`	Bug? `read_count` duplicated
`cost_lookup_variable_size`	`(runtime (linear n u2 u1))`	`(runtime (linear n u2 u18))`	++
`cost_user_function_application`	`(runtime (linear n u26 u5))`	`(runtime (linear n u27 u5))`	+
`cost_tuple_merge`	`(runtime (linear n u4 u408))`	`(runtime (linear n u5 u408))`	+
`cost_xor`	`(runtime (linear n u15 u129))`	`(runtime u139)`	-
`cost_eq`	`(runtime (linear n u7 u151))`	`(runtime (linear n u8 u151))`	+
`cost_sha256`	`(runtime (linear n u1 u100))`	`(runtime (linear n u1 u180))`	++
`cost_sha512`	`(runtime (linear n u1 u176))`	`(runtime (linear n u1 u216))`	++
`cost_as_max_len`	`(runtime u475)`	`(runtime u518)`	+
`cost_contract_of`	`(runtime u13400)`	`(runtime u19002)`	++
`cost_block_info`	`runtime: u6321`	`runtime: u7841`	++
`cost_stx_balance`	`runtime: u4294`	`runtime: u1712`	--
`cost_stx_transfer`	`runtime: u4640`	`runtime: u1672`	--
`cost_nft_owner`	`runtime: (linear n u9 u795)`	`runtime: (linear n u10 u741)`	?
`cost_burn_block_info`	`runtime: u96479`	`runtime: u105877`	+
`cost_stx_account`	`runtime: u4654`	`runtime: u2021`	--
`cost_stx_transfer_memo`	`runtime: u4709`	`runtime: u1761`	--
`cost_bitwise_and`	`(runtime (linear n u15 u129))`		No output
`cost_bitwise_or`	`(runtime (linear n u15 u129))`		No output
`cost_bitwise_not`	`(runtime u147)`		No output
`cost_bitwise_left_shift`	`(runtime u167)`		No output
`cost_bitwise_right_shift`	`(runtime u167)`		No output

Analysis

Many (most?) costs have not changed between costs-3 and costs-4. For those that have changed, most show an increase in cost. A few however, notably the stx_* functions dealing with STX accounts and and balances, have shown a significant decrease.

saralab commented 5 months ago

Tweaking the read limits to run another round of benchmarking

obycode commented 5 months ago

Here are my thoughts about changes to the mechanism for setting the cost limits in Nakamoto that I believe require some discussion.

In the previous cost-limit calculations, the goal is to determine what is a reasonable limit on the various cost components that would allow some reasonable hardware to process the block in 30 seconds. The 30 second time limit is chosen based on the probability that ~95% of bitcoin blocks will arrive in > 30 seconds. In practice, that means that a 20 minute block is limited to the cost budget of a 30 second block -- an overly cautious and pessimistic situation.

This difficulty in defining a limit to handle the majority of cases can be improved upon in Nakamoto due to the inclusion of the tenure extend functionality. Thus far, this tenure extend functionality has been discussed to refresh the block budget after 10 minutes, to allow for better handling of long block times (~37% of blocks are greater than 10 minutes). My proposal here is that we could get much better results if instead, the intention is to extend the tenure in smaller chunks of time, so that the budget can more closely match the actual time, allowing for the appropriate limitations such that slower nodes can keep up with the network, but without being overly cautious and penalizing the network for the ~95% of blocks that are > 30s.

The proposal then is to continue to choose the limits based on the 30s block time, but then issue tenure extensions more often. If tenure extensions were issued every 30s, matching the budget limits, we would end up in a case where slower nodes cannot keep up, but if they are issued every minute, this guarantees that the budget accomplishes its goal while also 10x-ing the budget for an average 10 minute block.

The other advantage of this strategy is that it further encourages miners to continue to mine blocks throughout their entire tenure. With the current strategy, a fast miner with a long bitcoin block could spend its entire budget very quickly, then be unable to mine any more blocks until 10 minutes has passed and it processes a tenure extension to refresh its budget. This is bad for transaction latency. With this proposed model, the fast miner gets a fresh budget every minute.

I don't believe this proposal should cause any new work or new problems, since we already planned to support tenure extensions, and the frequency of these extensions (1 minute) is still plenty high that it should not cause any problems with network congestion or other similar concerns.

The one downside to maintaining a block limit based on a 30 second block is that it limits the cost of individual transactions. For example, a budget based on 10 minute blocks would allow for a single transaction that used the entire budget to be 20x larger than a single transaction using the entire 30s budget. In practice, the 30 second budget I am proposing will already be larger than the current budget, and thus far, it has been feasible to structure all useful transactions such that they can fit into these current blocks.

jcnelson commented 5 months ago

In the previous cost-limit calculations, the goal is to determine what is a reasonable limit on the various cost components that would allow some reasonable hardware to process the block in 30 seconds. The 30 second time limit is chosen based on the probability that ~95% of bitcoin blocks will arrive in > 30 seconds. In practice, that means that a 20 minute block is limited to the cost budget of a 30 second block -- an overly cautious and pessimistic situation.

The other major concern is initial boot-up time. The time to process a block should be considerably faster (e.g. by orders of magnitude) than the time to mine it. Otherwise, it could take months or even years for a node to catch up with the chain tip.

My proposal here is that we could get much better results if instead, the intention is to extend the tenure in smaller chunks of time, so that the budget can more closely match the actual time, allowing for the appropriate limitations such that slower nodes can keep up with the network, but without being overly cautious and penalizing the network for the ~95% of blocks that are > 30s.

In practice, the 30 second budget I am proposing will already be larger than the current budget, and thus far, it has been feasible to structure all useful transactions such that they can fit into these current blocks.

If you make the tenure budget too small, then certain kinds of transactions that are valid today and have been mined today would cease to be mineable. There are some today that can take double-digit percentages of a block.

The other advantage of this strategy is that it further encourages miners to continue to mine blocks throughout their entire tenure.

We can do this without issuing tenure-extensions and hitting the above problem: signers throttle miners based on the rate of resource consumption. Signers can track how much of each dimension the miners use, and if they exceed a rate of $dimension / 120 per 5 seconds, then signers delay signing the block in order to throttle down the consumption. This throttling logic can be done independent of tenure extensions, so we can continue to accommodate large transactions.

obycode commented 5 months ago

The time to process a block should be considerably faster (e.g. by orders of magnitude) than the time to mine it.

There is overhead in selecting transactions to include in the block, but the actual execution of the transactions should be identical between mining and processing, no?

If you make the tenure budget too small, then certain kinds of transactions that are valid today and have been mined today would cease to be mineable. There are some today that can take double-digit percentages of a block.

I addressed this in the last paragraph. The 30s budget that we choose would be greater than the current budget, so this should not be a problem. I am not suggesting to shrink this further, for example down to 5s limits as might be a logical step.

We can do this without issuing tenure-extensions and hitting the above problem: signers throttle miners based on the rate of resource consumption. Signers can track how much of each dimension the miners use, and if they exceed a rate of $dimension / 120 per 5 seconds, then signers delay signing the block in order to throttle down the consumption. This throttling logic can be done independent of tenure extensions, so we can continue to accommodate large transactions.

I agree, that we could do throttling in this way, but then this needs to be separate logic in the signer to do this, while my proposal doesn't require any additional logic over that which is already needed for the 10 minute tenure extension. Is there another benefit to taking the approach you've described that makes it the better option?

cylewitruk commented 5 months ago

In regards to profiling, just as an FYI since this issue seems relatively generic - I'm currently working on some node db/storage/serialization benching as a result of performance issues I've come across in the a/b tester, particularly in regard to needing to replay chainstate multiple times during development which takes way too much time as it is now.

For anybody else doing profiling/benching: A lot of contention is created by the oracle-v1 contract's add-prices function (the heaviest repeatedly-called function I've run into so-far), so it's a good contender for any profiling/benchmarking which includes both heavy computation and i/o.

I've also got a branch with tracing implemented with custom layers I've been using to time/visualize the call chain a little easier if that's of interest to anybody. Example:

jbencin commented 4 months ago

`ClarityDB` read benchmarking results

These are the results of running the ClarityDB read benchmarks in. We are trying to determine how long a database read actually takes, in order to determine the max read_count for a block. I will update this with additional results as they are available

These tests were performed with the following version of the Stacks blockchain database from https://archive.hiro.so/:

2024-02-06T05:34:11.319Z        71.9 GB        mainnet-stacks-blockchain-2.4.0.0.4-20240206.tar.gz

Running on my laptop

I'm using a Lenovo ThinkPad X1 with Manjaro Linux. Key specs include:

12th Gen Intel Core i7-1270P
1 TB, PCIe 4.0 M.2 NVMe disk (Samsung MZVL21T0HCLR-00BL7)
32 GB RAM

Results with clearing filesystem cache once (after writes)

get_one:sequential      time:   [49.713 µs 49.844 µs 50.073 µs]
get_one:random          time:   [22.989 µs 23.821 µs 24.750 µs]

Results with clearing filesystem cache before each read

get_one:sequential      time:   [49.756 µs 50.253 µs 50.733 µs]
                        change: [+0.5648% +2.1573% +4.1136%] (p = 0.01 < 0.05)
                        Change within noise threshold.

get_one:random          time:   [22.745 µs 23.650 µs 24.640 µs]
                        change: [-8.3554% +4.4837% +17.821%] (p = 0.52 > 0.05)
                        No change in performance detected.

Running on GCP `n2-standard-4`

Results with clearing filesystem cache once (after writes)

get_one:sequential      time:   [142.64 µs 143.07 µs 143.51 µs]
get_one:random          time:   [76.410 µs 79.783 µs 84.516 µs]

Conclusions

Clearing cache once after writes, or before each read, doesn't make a noticeable difference
No idea why random reads are twice as fast
GCP SSD storage is significantly slower than modern mid-range consumer hardware
Current block limits assume a read time of 10ms. This is off by a factor of ~70-200, depending on hardware. These tests support @kantai's conclusions in #3777

jbencin commented 4 months ago

Some more ideas for optimization:

Experiment with different compiler/linker flags in [profile.release]. See here for list
Try using jemalloc instead of Rust's default allocator
Replace large clones with std::borrow::Cow
Reduce allocations in Serde (link)
Use custom allocators (like bumpalo) in places where it makes sense

cylewitruk commented 4 months ago

I am going to start working on a PR tomorrow for updating the storage and retrieval of contracts + analysis. According to benchmarks, contract storage and read performance is greatly improved by storing contract data in a single table instead of as 5+ key-value entries. Also, using messagepack serialization (instead of json) + lz4 compression for the contract AST + analysis is almost twice as fast (and twice as small) as the current solution.

jcnelson commented 4 months ago

Just FYI, messagepack is not 1.0 yet, so I'd shy away from using it. Things could break.

Also, I'd recommend lzma-rs over libzzz, since lzma-rs is a pure rust implementation of LZ4, instead of a thin wrapper around liblz4.

cylewitruk commented 4 months ago

Just FYI, messagepack is not 1.0 yet, so I'd shy away from using it. Things could break.

I've used it in past projects with great success and the current spec has been stable for seven years. As long as we don't randomly upgrade the dependency between spec changes without requiring node resyncs, it will be fine :) The arguably "better" alternative would of course be protobuf, but I personally think that the added complexity with protobuf could be added in a later iteration if even more performance needed to be squeezed out. The really nice thing with rmp is that it's pure rust and has a nice serde integration, so the migration path for Clarity[De]Serializable's is really clean.

Also, I'd recommend lzma-rs over libzzz, since lzma-rs is a pure rust implementation of LZ4, instead of a thin wrapper around liblz4.

I had actually missed that library when I looked at these benchmarks - but according to those, lzma-rs is considerably slower than lzzzz, which appears to be the fastest of them all (and has a really clean API as well). But I'll do a separate benchmark with the most popular crates and Clarity payloads.

I'll create two separate PR's (one for DB and one for serialization) so that the topics can be discussed separately :)

jcnelson commented 4 months ago

The arguably "better" alternative would of course be protobuf

The arguably "better" alternative is to just bite the bullet and use StacksMessageCodec for all the things we currently use serde_json for. Not only is that pure, safe Rust, all of its dependencies are too (unlike rmp).

You may have had success in using rmp in your personal projects, and that's great. However, whatever codec we use, we'll have to support it for years on end. And, any bugs in it will likely be consensus bugs, and if blocks are accepted whose validation depends on those bugs' execution, we're stuck with having to maintain a bug-for-bug compatible version of rmp.

I had actually missed that library when I looked at these benchmarks - but according to those, lzma-rs is considerably slower than lzzzz, which appears to be the fastest of them all (and has a really clean API as well). But I'll do a separate benchmark with the most popular crates and Clarity payloads.

When it comes to blockchains -- that is, systems which not only manage other peoples' money, but also offer exactly zero recourse in the event of a remotely-triggered memory corruption bug -- safety is infinitely more important than speed. CVEs like this are a very compelling reason to stay far, far away from packages that link lz4

kantai commented 4 months ago

The arguably "better" alternative is to just bite the bullet and use StacksMessageCodec for all the things we currently use serde_json for. Not only is that pure, safe Rust, all of its dependencies are too (unlike rmp).

I disagree. I think StacksMessageCodec is already overused in the repository and codebase. I believe that consensus_serialize and consensus_deserialize should only be used and defined for consensus-critical serializations. If those consensus-critical serializations overlap with non-consensus-critical serializations, that's fine, but it should implement a second trait or whatever.

I'm not just trying to be pedantic here, either. I think its actually important to distinguish between serializations that are consensus-critical and those that are not because they have actually differing requirements. Consensus-critical serializations cannot change between versions without forking logic (and of course, a fork), so their implementations must be made cognizant of that. Non-consensus-critical serializations do not require this property (or at least, do not have anywhere near the same level of consequences for this property). They have other important properties, like they should be memory safe, but those are actually different properties, and I think could lead to other implementations that prioritize speed, overhead, and maintainability. Concerns about protobuf (or whatever else) may be valid with respect to consensus-critical serializations, but totally invalid when applied to non-consensus-critical serializations. Because its impossible to easily distinguish between which StacksMessageCodec implementations are consensus-critical and which aren't, it makes improvements there very very hard, and also I think confuses the whole topic.

jcnelson commented 4 months ago

Sure, those are all fair points, and I agree on the top-level bit that we should distinguish between data structures that have consensus-critical byte representations (begetting consensus-critical codecs) and those that do not. I'd further differentiate between structures whose non-consensus-critical byte representations can be upgraded on a node-by-node basis (e.g. the DB representation) versus those which still require a network upgrade to roll out (e.g. p2p message packet representations). The point I was trying to make was less about forcing everything to use StacksMessageCodec and more about preferring it to codec libraries that are unsafe or not yet declared stable.

Is there a discussion somewhere on which structures can have non-consensus-critical DB representations in Clarity?

cylewitruk commented 4 months ago

The arguably "better" alternative is to just bite the bullet and use StacksMessageCodec for all the things we currently use serde_json for.

I'm (obviously) in agreement with @kantal here. It would take a lot of (unnecessary) effort to write a correct and bug-free (de)serialization implementation using StacksMessageCodec for the contract AST and analysis structures given their complexity.

Regarding consensus-critical serialization, however, it frankly surprises me that the project isn't using a serialization framework like Protobuf which uses well-defined, versioned schemas and generated message contracts (.rs) which you can review, compare and include in the project. Not to mention also clear strategies for handling backwards-/forwards-compatibility and language portability. That would have saved a good deal of complexity, effort and potential for hand-written serialization bugs.. :man_shrugging:

You may have had success in using rmp in your personal projects, and that's great. However, whatever codec we use, we'll have to support it for years on end. And, any bugs in it will likely be consensus bugs, and if blocks are accepted whose validation depends on those bugs' execution, we're stuck with having to maintain a bug-for-bug compatible version of rmp.

I didn't say that I've used MessagePack in personal projects - we've used it in mission-critical, event-driven applications processing millions of- and ~$1bn worth of transactions per annum without [serialization] issues, which is one of the reasons it's at the top of my list. If you research it you will also find that it's one of the most common binary serialization formats next to BSON, Avro, Protobuf, etc. As @kantal points out, I'm talking about a non-consensus-critical code path. As long as consensus-critical data can be re-hydrated to its consensus-critical representation, how it's persisted is a node implementation detail.

Why I advocate for MessagePack over Protobuf for this path is I don't believe the benefits outweigh the added overhead of maintaining schemas and code-generation when it's not strictly needed - unless we would need to save on every byte and CPU cycle we can.

When it comes to blockchains -- that is, systems which not only manage other peoples' money, but also offer exactly zero recourse in the event of a remotely-triggered memory corruption bug -- safety is infinitely more important than speed. CVEs like this are a very compelling reason to stay far, far away from packages that link lz4

I'm aware of the security concerns and constraints of a blockchain - I'm just more accustomed to such things being handled using process and objective risk assessments. I have difficulty accepting past CVE's as a blanket argument because, well, it feels slightly hypocritical given the project's current dependencies... (just a quick search):

... and not a reliable predictor to the future.

There are always going to be more bugs and more CVE's, even close to home. I've seen a number of critical home-grown bugs in the last year requiring node patching and/or forks; I can't say the same for issues with dependencies, but please correct me if I'm mistaken?

When it comes to lz4, the CVE linked would not have been a risk factor as it requires a crafted file to be passed to the decompression method, which isn't possible as the compressed "file" is created by the node itself. Unless you had access to the node's database of course - but then you've got bigger issues.

I'm not trying to say that new dependencies should just be allowed in "hey-ho" :cowboy_hat_face: - I'm arguing for objective risk management.

Is there a discussion somewhere on which structures can have non-consensus-critical DB representations in Clarity?

Not that I've seen, but in general anything stored in any of the SQLite databases can be physically represented in any format given the constraint that it can be re-hydrated to its consensus-critical format (which right now I'm simply using the assumption of that which goes into the persistence layer must come out the same way, to keep changes minimal). The primary two contenders are the contract AST and contract analysis, which are JSON-serialized and stored directly in SQLite in Clarity's metadata_table today.

Anyway, I feel like we've kind-of hijacked this issue... should we move it to a discussion thread instead?

wileyj commented 4 months ago

The arguably "better" alternative is to just bite the bullet and use StacksMessageCodec for all the things we currently use serde_json for.

I'm (obviously) in agreement with @kantal here. It would take a lot of (unnecessary) effort to write a correct and bug-free (de)serialization implementation using StacksMessageCodec for the contract AST and analysis structures given their complexity.

Regarding consensus-critical serialization, however, it frankly surprises me that the project isn't using a serialization framework like Protobuf which uses well-defined, versioned schemas and generated message contracts (.rs) which you can review, compare and include in the project. Not to mention also clear strategies for handling backwards-/forwards-compatibility and language portability. That would have saved a good deal of complexity, effort and potential for hand-written serialization bugs.. 🤷‍♂️

You may have had success in using rmp in your personal projects, and that's great. However, whatever codec we use, we'll have to support it for years on end. And, any bugs in it will likely be consensus bugs, and if blocks are accepted whose validation depends on those bugs' execution, we're stuck with having to maintain a bug-for-bug compatible version of rmp.

I didn't say that I've used MessagePack in personal projects - we've used it in mission-critical, event-driven applications processing millions of- and ~$1bn worth of transactions per annum without [serialization] issues, which is one of the reasons it's at the top of my list. If you research it you will also find that it's one of the most common binary serialization formats next to BSON, Avro, Protobuf, etc. As @kantal points out, I'm talking about a non-consensus-critical code path. As long as consensus-critical data can be re-hydrated to its consensus-critical representation, how it's persisted is a node implementation detail.

that's fair, but i think it's important to consider the blast radius of such changes - in a distributed system, a minor bug can (and does, as we've seen in the decent past) have an extremely large blast radius that affects nearly everyone using the chain. i think the unknown here is separating between consensus-critical and non-consensus-critical, and if that can be done reliably and predictably (particularly when adding new functions or refactoring/maintaining existing code). it's possible, but my concern is missing something in that refactor that is only apparent once it's released and thus much harder to fix.

Why I advocate for MessagePack over Protobuf for this path is I don't believe the benefits outweigh the added overhead of maintaining schemas and code-generation when it's not strictly needed - unless we would need to save on every byte and CPU cycle we can.

When it comes to blockchains -- that is, systems which not only manage other peoples' money, but also offer exactly zero recourse in the event of a remotely-triggered memory corruption bug -- safety is infinitely more important than speed. CVEs like this are a very compelling reason to stay far, far away from packages that link lz4

I'm aware of the security concerns and constraints of a blockchain - I'm just more accustomed to such things being handled using process and objective risk assessments. I have difficulty accepting past CVE's as a blanket argument because, well, it feels slightly hypocritical given the project's current dependencies... (just a quick search):

https://nvd.nist.gov/vuln/detail/CVE-2021-38195

https://www.cvedetails.com/vulnerability-list/vendor_id-23945/product_id-90260/Rusqlite-Project-Rusqlite.html

https://www.sqlite.org/cves.html

https://www.cvedetails.com/vulnerability-list/vendor_id-19029/Rust-lang.html

... and not a reliable predictor to the future.

I'll admit i'm not too familiar with the diff between the packages you're talking about here - but one thing that i have seen in the past is if we were to use an external package it definitely comes with increased risk and it's not entirely clear what the benefit is (perhaps you can share more about the expected benefits in a new discussion?). off the top of my head, the less dependencies that we rely on, the better from a security/reliability standpoint (even a versioned dependency can be better than using upstream, re: the versioned bitcoin crate we use today had a breaking change not too long ago) - we're essentially trusting that upstream maintainers have a rigid process in place for ensuring a release doesn't break/change existing and expected functionality. additionally, there is the issue of a supply-chain attack (as we've seen recently with npm packages). i don't think anyone is arguing that we shouldn't use a dependency because it has had a CVE - but if i understand your intent, i think it's more about managing the risk (please correct me if i'm incorrect here).

There are always going to be more bugs and more CVE's, even close to home. I've seen a number of critical home-grown bugs in the last year requiring node patching and/or forks; I can't say the same for issues with dependencies, but please correct me if I'm mistaken?

When it comes to lz4, the CVE linked would not have been a risk factor as it requires a crafted file to be passed to the decompression method, which isn't possible as the compressed "file" is created by the node itself. Unless you had access to the node's database of course - but then you've got bigger issues.

I'm not trying to say that new dependencies should just be allowed in "hey-ho" 🤠 - I'm arguing for objective risk management.

This is, I think, the most important aspect of what you're proposing - how would an objective risk management work for proposed changes as described here? is that still unknown today, or maybe you have some ideas you're saving for a discussion post?

Is there a discussion somewhere on which structures can have non-consensus-critical DB representations in Clarity?

Not that I've seen, but in general anything stored in any of the SQLite databases can be physically represented in any format given the constraint that it can be re-hydrated to its consensus-critical format (which right now I'm simply using the assumption of that which goes into the persistence layer must come out the same way, to keep changes minimal). The primary two contenders are the contract AST and contract analysis, which are JSON-serialized and stored directly in SQLite in Clarity's metadata_table today.

Anyway, I feel like we've kind-of hijacked this issue... should we move it to a discussion thread instead?

Please! i think this is good discussion to have and there are a lot viewpoints that we should consider vs the visibility in this specific issue.

i'm particularly interested in the risk management you've mentioned, and if there are any ideas on how that could be accomplished for a decentralized project. I'm glad you added this line I'm not trying to say that new dependencies should just be allowed in "hey-ho" as well - it's important to recognize that there is still a lot of work that needs to be done around clarity/clarity-wasm/nakamoto/sbtc that should have our priority over the next new months, and i'm hesitant to start talking about potentially breaking changes while that work is ongoing (and is dependent on specific operations working as we expect them to today). there is an plan for future work to refactor the entire codebase into a more modular approach - would it be sufficient to start a discussion on these ideas, with the expectation that we shouldn't consider making any changes until that refactoring work has started? I can see you have strong opinions about what you're proposing, i just want to be careful that we're not distracting from the prioritized work that needs to be done over a short timeframe over the next few months.

so, i would recommend that we move this to a discussion (so we no longer have this issue hijacked), with the expectation that there may be little action until after the 2nd proposed hard fork later this year.

cylewitruk commented 4 months ago

To avoid further contaminating this issue I'll summarize all of the above in a separate github discussion when I have some time and we can continue there 😊

jbencin commented 4 months ago

Performance Update

Here is the total effect on block processing time of all the optimization PRs I've done so far (including #4408 which is not merged yet. Please review!):

❯ hyperfine -w 3 -r 20 "stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000" "stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):     12.164 s ±  0.049 s    [User: 11.641 s, System: 0.472 s]
  Range (min … max):   12.102 s … 12.324 s    20 runs

Benchmark 2: stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):      8.789 s ±  0.035 s    [User: 8.309 s, System: 0.434 s]
  Range (min … max):    8.728 s …  8.901 s    20 runs

Summary
  stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000 ran
    1.38 ± 0.01 times faster than stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000

This benchmark shows a 38% improvement in processing speed over an arbitrary range of blocks. Since the build changes had by far the largest effect, we will probably see significant improvements in other areas, like API performance, also

cylewitruk commented 4 months ago

This benchmark shows a 38% improvement in processing speed over an arbitrary range of blocks. Since the build changes had by far the largest effect, we will probably see significant improvements in other areas, like API performance, also

Could you try this over, say, 1000 blocks, and ideally closer to block 135k? Might take a couple hours, but it'll be a much better sample and thus indication of the performance gains. I think 135k should cover the introduction of ordinals...

jbencin commented 4 months ago

Could you try this over, say, 1000 blocks

Eventually, but that would take a long time. I've been using a benchmark that is reasonably fast so that I can quickly test my changes

I think 135k should cover the introduction of ordinals

I will try this on the most recent blocks now that consensus bugs in next are fixed

jbencin commented 4 months ago

@cylewitruk here's the performance on block range 145,000-145,050 with all current PRs (everything used in the previous comment, plus #4425):

❯ hyperfine -w 3 -r 10 "stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 145000 145050" "stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 145000 145050"
Benchmark 1: stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 145000 145050
  Time (mean ± σ):     253.175 s ±  1.004 s    [User: 245.166 s, System: 6.936 s]
  Range (min … max):   250.886 s … 254.416 s    10 runs

Benchmark 2: stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 145000 145050
  Time (mean ± σ):     178.373 s ±  0.311 s    [User: 170.661 s, System: 6.766 s]
  Range (min … max):   178.091 s … 178.977 s    10 runs

Summary
  stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 145000 145050 ran
    1.42 ± 0.01 times faster than stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 145000 145050

This is almost exactly what I see with these binaries on the previous block range (99,990-100,000):

❯ hyperfine -w 3 -r 10 "stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000" "stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):     12.160 s ±  0.030 s    [User: 11.623 s, System: 0.490 s]
  Range (min … max):   12.106 s … 12.200 s    10 runs

Benchmark 2: stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):      8.594 s ±  0.009 s    [User: 8.110 s, System: 0.441 s]
  Range (min … max):    8.582 s …  8.606 s    10 runs

Summary
  stacks-core/target/release/stacks-inspect.all-optimizations replay-block /home/jbencin/data/next/ range 99990 100000 ran
    1.41 ± 0.00 times faster than stacks-core/target/release/stacks-inspect.7e5146511 replay-block /home/jbencin/data/next/ range 99990 100000

So we've got some nice performance gains here, that are consistent across modern Stacks blocks (as opposed to the much smaller gains we say on the 0-3,000 range)

cylewitruk commented 4 months ago

So we've got some nice performance gains here, that are consistent across modern Stacks blocks (as opposed to the much smaller gains we say on the 0-3,000 range)

Very nice! Promising! :smile: I still do think it would be wise to run an overnight larger block range, simply because 10/50 blocks are statistically insignificant sample windows (~1½ / ~8 hours) - those block ranges could be (not saying are) coincidentally misrepresentative, so it would be good to get the averages over 24-48 hours (150-300 blocks) or more so that we know we're covering multiple time-zones, days of the week, etc. i.e. different behavioral patterns.

jbencin commented 3 months ago

I generated a new costs-4.clar contract which includes the recent optimization work (as of commit 80adc2762). See the diff here. Lots of costs went down, but some went up. Can anyone (maybe @kantai or @obycode) help interpret this and figure out what these mean?

stacks-network / stacks-core

Find Optimizations in the current Clarity VM via benchmarking current VM and Wasm #4316

Summary of differences from costs-3.clar to costs-4.clar

Legend

Costs

Analysis

ClarityDB read benchmarking results

Running on my laptop

Results with clearing filesystem cache once (after writes)

Results with clearing filesystem cache before each read

Running on GCP n2-standard-4

Results with clearing filesystem cache once (after writes)

Conclusions

Performance Update

Summary of differences from `costs-3.clar` to `costs-4.clar`

`ClarityDB` read benchmarking results

Running on GCP `n2-standard-4`