neo-project / neo

NEO Smart Economy
MIT License
3.47k stars 1.03k forks source link

Towards an Execution Time Deterministic Virtual Machine #3535

Open Jim8y opened 3 days ago

Jim8y commented 3 days ago

Introduction

The Neo Virtual Machine (NeoVM) is a strongly-typed, stack-based virtual machine that operates on typed StackItems. However, its current pricing model, based on fixed-cost opcodes, fails to accurately represent the actual processing costs of operations. This is because the execution time of an operation can vary significantly depending on the size and complexity of the StackItems involved. For instance, operations on compound types with numerous subitems or large StackItems can consume substantially more computational resources than operations on simple types or smaller StackItems. Despite these differences in resource consumption, all operations are currently priced equally under the fixed opcode model. This discrepancy between the fixed price and the actual computational cost can lead to inefficient resource allocation and potential exploitation of the system.

This discrepancy makes it impossible to accurately determine the execution cost of the virtual machine. Moreover, NeoVM includes a reference-counting garbage collection (GC) mechanism, whose computational cost can only be determined at runtime, further exacerbating the uncertainty of the virtual machine's execution cost.

Due to this uncertainty, we cannot precisely determine the exact execution time of transactions in a Neo block, and therefore cannot set a definitive maximum execution time for Neo blocks to ensure consensus stability.

Furthermore, because operations with vastly different costs have the same price, attackers can easily construct scripts that execute for a long time with very small fees, thereby launching Denial of Service (DoS) attacks on the Neo network with the aim of paralyzing it.

In this proposal, we present two protocols to address these issues:

  1. Dynamic Opcode Pricing Protocol (DOPP): Adjust fees based on the exact computational cost of complex opcode operations, which could depend on factors such as the number of subitems and the size of the StackItem being processed.
  2. Simple Object Management Protocol (SOMP): Simplify object management by removing the GC and explicitly tracking the lifecycle of all objects in the virtual machine, including cyclically referenced objects.

Problem Definition

Unfair Fixed-Price Opcode Model

  1. Inefficient Resource Allocation:

    • Processing compound types with varying numbers of subitems consumes different amounts of computational resources (up to thousands of times difference) yet incurs the same fee.
    • Large StackItems require more processing time and resources but are charged the same as smaller ones.
  2. Potential for Exploitation:

    • Malicious actors could exploit this system by creating transactions with large compound types or buffers, consuming disproportionate resources without corresponding fee increases.
  3. Discouraged Use of Efficient Data Structures:

    • Developers might avoid using larger, more efficient data structures due to the fixed-price model, potentially leading to suboptimal smart contract designs.

Unpredictable and Unnecessary Garbage Collection (GC)

  1. Unpredictable GC:

    • GC is dynamically triggered during transaction execution, making it impossible to predict its behavior in advance. The execution cost is entirely determined by the stack object relationships within the execution engine at the time of GC triggering. This cost can vary greatly, and we cannot predict it beforehand. Even if we could, based on the existing architecture, we cannot restrict its execution as it might interfere with the normal execution of contracts.
    • Attackers could construct scripts that deliberately trigger complex GC processes frequently, leading to prolonged transaction execution times, blocking consensus, and effectively performing a Denial of Service (DoS) attack.
  2. Unnecessary GC:

    • The NeoVM is designed to be lightweight and is expected to have a short lifetime and execution time. It should not be expected to create and manage many objects, yet the GC system is expensive and has high execution costs, being designed for heavier tasks and programs.
    • Currently, the main purpose of GC in NeoVM is to release objects with circular references, allowing for accurate tracking of the overall reference count of the execution engine (used to track and control the current live VM objects to prevent memory overflow). However, we could ignore circular references and instead avoid them at the compiler level. All objects would be destroyed after the execution engine is terminated, preventing memory overflow.

Case Studies

Benchmark Samples

Compound Type Operations with GC

Item Count NEWARRAY UNPACK CLEAR
32 3.958 μs 5.960 μs 16.731 μs
128 5.721 μs 10.803 μs 48.562 μs
1,024 26.086 μs 190.762 μs 505.134 μs
2,040 32.465 μs 496.604 μs 939.693 μs

Compound Type Operations without GC

Item Count NEWARRAY UNPACK CLEAR
32 3.830 μs 4.735 μs 4.373 μs
128 5.643 μs 12.570 μs 9.457 μs
1,024 25.644 μs 29.246 μs 35.977 μs
2,040 31.522 μs 59.574 μs 81.830 μs

Analysis:

Overall, removing GC shows a clear performance benefit, especially for operations that involve creating, manipulating, or removing large amounts of StackItems from the stack. The impact is most pronounced for larger collections, where the performance improvement can be up to 93% faster for certain operations.

Protocols of the Proposal

We propose achieving an execution time deterministic VM by applying the following two protocols to the Neo Virtual Machine:

  1. Dynamic Opcode Pricing Protocol (DOPP):

    The execution cost of opcode operations varies greatly with the number of subitems and the size of the StackItem being processed. Fixed-price opcodes cannot reflect the actual computational cost. Therefore, we should adjust opcode fees based on the exact computing cost, considering all relevant factors.

    By adjusting fees based on the precise computational cost of complex opcode operations, we can:

    • Ensure Fair Pricing: Fees will more accurately reflect the actual computational resources consumed.
    • Improve Resource Allocation: Discourage unnecessarily large data structures, leading to overall improved network performance.
    • Enhance Security: Mitigate potential exploits related to processing large data structures, reducing the potential for resource exploitation.
    • Encourage Efficient Contract Design: Provide developers with incentives to optimize their data structures and operations.
    • Maintain Platform Competitiveness: Keep Neo at the forefront of blockchain technology through the implementation of more granular and fair pricing mechanisms.
  2. Simple Object Management Protocol (SOMP):

    We propose simplifying object management by removing the reference-counting GC and instead explicitly tracking the lifecycle of all objects in the virtual machine. This approach enhances performance and clarity by eliminating the overhead of reference counting and simplifying the memory management process.

    By removing the GC, we can simplify the VM architecture and potentially improve overall performance.

Implementation Strategy

We propose modifying the current fee structure for operations involving compound types and buffers. The new system will implement a base fee plus a variable component that scales with the number of subitems and/or the StackItem size. However, considering the complexity of calculating the actual fee for operations involving many subitems with varying sizes, we propose the following approach:

  1. Base Fee: Maintain a reduced base fee for affected complex opcodes.
  2. Variable Component: Implement a scaling factor based on: a) The number of subitems b) A default maximum StackItem size

This approach balances accuracy with computational efficiency:

This method provides a reasonable approximation of the actual computational cost while avoiding the heavy overhead associated with individually pricing each subitem. It maintains the benefits of dynamic pricing while keeping the fee calculation process efficient and deterministic.

Key Components:

  1. Base Fee Reduction: Lower the base fee for affected complex opcodes to accommodate the new variable component.
  2. Variable Fee Component: Implement a scaling factor based on subitem count or buffer size.
  3. Fee Calculation Function: Develop a function to compute the total fee based on the base fee and the variable component.

Tasks:

  1. OpCode Benchmark: Benchmark the execution cost of OpCode operations, which is not a system call or native contract call, for different StackItem sizes and subitem counts.
  2. Native Contract Benchmark: Benchmark the execution cost of native contracts with different parameters.
  3. System Call Benchmark: Benchmark the execution cost of system calls with different parameters.
  4. OpCode Price Ratio: Determine the price ratio of OpCode operations based on the benchmark results.
  5. New Object Management: Remove the GC and explicitly track the lifecycle of all objects in the virtual machine.

Backward Compatibility

Implementing these protocols will require a hard fork to upgrade the VM implementation. This means that all nodes in the network will need to update their software to support the new protocols. And smart contracts that depend on precise gas fees for execution will be affected and may need to be updated.

Drawbacks and Risks

  1. Partial Mitigation of DoS Threats: This proposal can only alleviate the DoS threat in the VM caused by complex opcode operations; it cannot solve all DoS threats in the Neo system.

  2. Increased Computational Overhead: Calculating the fees requires additional computational resources, which could impact the performance of the NeoVM.

  3. Potential New Attack Vectors: Without careful implementation, the changes might expose new DoS attack surfaces. And we removed the GC that handles cyclic references, so we must ensure that cyclic references do not cause problems.

  4. Impact on Existing Smart Contracts: Smart contracts that depend on precise gas fees for execution will be affected and may need to be updated.

  5. Transition Challenges: The transition period might be challenging as developers and users adapt to the new fee structure.

Expected Impact

  1. Predictable Execution Time: Execution time can be predicted, and fees can be adjusted in real-time. With the same amount of gas, the VM will consistently execute for approximately the same amount of time.

  2. Consensus Stability: We can set an upper bound for execution time by limiting the maximum gas fee in transactions or blocks, enhancing consensus stability.

  3. Reduced Transaction Fees for Users: Ordinary users will likely experience lower transaction fees because the base fee for opcode operations will be reduced, while attack costs will increase significantly, making malicious activities uneconomical.

  4. Simplified VM Architecture: Removing the GC will simplify the VM architecture and potentially improve overall performance.

  5. Incentivized Efficient Contract Development: Developers will be encouraged to write more efficient smart contracts, leading to better overall network performance.

Further Considerations

We can also use an increase-only counter for references or an increase-only counter for StackItem creation. With a higher counter value that is above a threshold, the opcode price will be higher, which increases the cost of attacks while normal contract execution remains unaffected.

Conclusion

The Neo Virtual Machine currently faces two critical issues: pricing discrepancies in OpCode operations and expensive garbage collection. The fixed-price OpCode model fails to accurately reflect the computational costs of complex operations, while the reference-counting garbage collection system introduces unpredictability and unnecessary overhead.

To address these challenges, we propose two protocols:

  1. Dynamic Opcode Pricing Protocol (DOPP): Adjusts fees based on the exact computational cost of complex OpCode operations.
  2. Simple Object Management Protocol (SOMP): Simplifies object management by removing garbage collection and explicitly tracking object lifecycles.

These protocols are expected to:

By implementing these changes, Neo aims to offer a more efficient, secure, and developer-friendly blockchain platform, maintaining its competitive edge in the evolving blockchain landscape.

roman-khimov commented 3 days ago

What's new here except duplicating #3510 and #3517? We need something real, not ChatGPT.

Jim8y commented 3 days ago

What's new here except duplicating #3510 and #3517? We need something real, not ChatGPT.

chatgpt polished, yes. major content from 3510 and 3517, yes. something real? not sure. are chatgpt being banned to polish issue and proposal? never heard of. why am i creating a new issue that combine 3510 and 3517? its requested by TC to make them into one.

Jim8y commented 3 days ago

@roman-khimov from the very begining until the last core meeting, ive kept saying i admire your ability of writing fancy documents, i still do. i can never make my idea into 10 page documents, that is why i need chatgpt to polish my idea. you dont like chatgpt? its ok, we still can discuss the idea in core meeting, i can explain it brifly.

roman-khimov commented 3 days ago

chatgpt polished

I wouldn't call this polishing. Unfortunately, GPT-style texts are very low quality and the main problem is exactly that --- lots of words that don't bring any new ideas. Which is disrespectful to the reader when being posted as originally written. We already have #3510 and #3517, they're different (albeit related) problems, we can discuss details (real details, not GPT nonsense) of each in respective issues. This one just wastes our time. TC included. Because TC needs some real intro into #3510, not this wall of text. And TC may not care much about #3517 which is just an optimization. If it does, then it's a separate matter anyway.

i can explain it brifly

There is nothing to explain at this point. We know #3510/#3517. We need more of raw data and exact calculations for #3510. We have #3526 (sorry, I've not delved into it, I can just summarize my view of this --- copy/paste from NeoGo and we're good to go). We don't need #3535, it has no value.

vncoelho commented 3 days ago

I am in line with @roman-khimov, it is a little disrespect to the reader. That is why, recently, I am not reviewing papers generated by AI. And, personally, I think that this should strictly be explained in scientific papers, as well as here.

Jim8y commented 3 days ago

well, truth is i wrote everything, then use it to rephrase the grammer issue, low quality in others may cause of gpt itself, but low quality i wouls say its becaues of me here, a reason i admir you who can write excelent document. totally use gpt to write something is disrespectful, indeed, i agree with that, and that is also something i wont do.

Jim8y commented 3 days ago

but, on the other hand, if anypne would love to take my place and rewrite all the proposal in a respecting way, i appreciate that. cause i am not good at this and dont think i can do better than gpt, yet its annoying to "polish or however you think it is" with gpt. and i do have some other proposals to write, can someone write that for me?

shargon commented 3 days ago

3517 is coming

Jim8y commented 3 days ago

I am in line with @roman-khimov, it is a little disrespect to the reader. That is why, recently, I am not reviewing papers generated by AI. And, personally, I think that this should strictly be explained in scientific papers, as well as here.

@vncoelho am not trying to behave disrespect, i believe you also dont think it that way. i also reviewed a lot of papers, i know how annoying to review bad phased paper, but here we should not use that standard, cause english is not a requirement here, we will deal with people whose english are bad and need translators to communite, hehavor itself should not be considered disrespectful unless its intention it to disrespect.

Jim8y commented 3 days ago

BTW, we are a team, this proposal is required by TC from us, not just from me, if you truly think i am not doing it correct, please help me to have another version of yours.

vncoelho commented 3 days ago

but, on the other hand, if anypne would love to take my place and rewrite all the proposal in a respecting way, i appreciate that. cause i am not good at this and dont think i can do better than gpt, yet its annoying to "polish or however you think it is" with gpt. and i do have some other proposals to write, can someone write that for me?

I did not read yet, I just mentioned about @roman-khimov comment.

vncoelho commented 3 days ago

I am starting reading the text.

The introduction looks good to me. However, I would like to emphasize the following:

Moreover, NeoVM includes a reference-counting garbage collection (GC) mechanism, whose computational cost can only be determined at runtime, further exacerbating the uncertainty of the virtual machine's execution cost.

I am not in fully agreement with this. In my point of view, GC is not something to be considered in the price. This is a implementation (execution client) cost and it is not agnostic to implementation. It indeed differs from implementations. In principle, it should be refreshed and equipment should be dimensioned to handle this. This is more like a requirement to me than something related to the Virtual Machine operational costs.

I will described in other comments regarding what is my vision on costs, but in the past the costs mostly looks like to me as a combination between execution time and storage cost. If an operation requires huge temporary memory it is a another history to me and it will be reflected in consumed time.

So, in summary, the next paragraph of the introduction I am not in fully agreement.

Due to this uncertainty, we cannot precisely determine the exact execution time of transactions in a Neo block, and therefore cannot set a definitive maximum execution time for Neo blocks to ensure consensus stability.

vncoelho commented 3 days ago

Either DOPP or SOMP appears to be fine to me. SOMP looks like to be easier to be defined. While, as you are defending DOPP is more complex to define. However, I believe that these costs can be defined and published frequently by operators. For example, in DOPP we could have a smart contract where CNs send their processing time of opcodes from epoch to epoch. The average value becomes the opcode price.

vncoelho commented 3 days ago

1 - The topic Unfair Fixed-Price Opcode Model looks good to me and precise.

2 - Topic Unpredictable and Unnecessary Garbage Collection (GC) needs some polish to me. I am not in agreement with all its challenges and conclusions derived from it.

Let's go to Analysis.

vncoelho commented 3 days ago

Protocols of the Proposal

  1. DOPP: Fine to me and fair description, however:

    • Improve Resource Allocation: Discourage unnecessarily large data structures, leading to overall improved network performance.

    • I am not in fully agreement with this description. Why discourage use of large data if price is fair? Is it really exponential regarding computational time always?
  2. SOMP: Base fee plus Variable Component fee looks enough to be, and solves the core problem of the described main issue. However, I would not say that Base Fee Reduction needs to lower the base fee compared to nowadays. I would say that the current price is the base fee without the variable component. The rest is an adjustment to the lack of correct price charging on such items.

vncoelho commented 3 days ago

I am in agreement with the described Tasks

Regarding Backward Compatibility I am fine as well, this is expected.

In relation to Drawbacks and Risks, the only item that is relevant to me is Impact on Existing Smart Contracts and the transition is the same mentioned before regarding backward compatibility. The other items in this topic I thing they are speculative and maybe wrong regarding extra computational efforts.

The topic Further Considerations looks relevant to me.

igormcoelho commented 3 days ago

I agree that it's terrible to read text generated by AI, because it's too verbose... please just use your own words and thats fine @Jim8y . Anyway, I read the whole issue, and I believe that the community is not in favor of fully removing GC. Some time ago I did some studies on GC for C++, and shared some ideas with @vang1ong7ang , that largely contributed at the time for a better design of the library... it's almost ready for usage for around an year (https://github.com/igormcoelho/cycles), but truth is: it won't work well for C# or Go. It could help a lot on C++ and Rust, but I think that in our case here, sadly it won't help. It was meant to compete with GCPP, which was an alternative for Tarjan and Arena on C++, by Herb Sutter. So, we are back to the common alternatives: Tarjan (which is terrible) and Arena, that seems to be the last existing choice. From what I read, and I'll try to deepen specifically on the GC remove issue, we are dealing with an Arena pattern, right? We will put all objects in a list, and simply drop them when execution engine is done for that transaction. For me, it seems fine, specially for our use cases here. The idea of having some general GC or even some other library like my Cycles or GCPP from Mr Sutter is to aim for greater generalization of the applications of NeoVM... not limiting it to Blockchain, which was the main motivation behind moving from Neo 2 to Neo 3 (get away with blockchain specializations). But, at this point, since NeoVM is used on Neo for blockchain purposes, I agree with any memory simplification, such as an Arena.

Jim8y commented 2 days ago

This is a new version, purly written by me, with only very few from the original version (i wrote them as well).

Introduction

Neo VM is vulnerable to various DoS attacks, in the past few years, the neo security team have discoverd multiple DoS attacks that can stop the neo network at very low price. To mitigate those threat, we deployed many limitations to the VM, making sure that no attack-like complex operations can be performed. This, however, also limits the potential and ability of the Neo VM. After thourough investigation, we think that it is actual the design of the Neo VM that makes it easy to be DoSed, and simply apply various limitations can not really solve the DoS attack, as attacker can alwasys find a way to DoS the VM.

The minumum operating unit in Neo VM is StackItem, its stong typed. Yet the price for operations upon StackItems are type-agnostic, with the same operation, no matter how complex the actual cost is, the price will be the same. However, in reality, the actual cost for operations on different types can be a huge difference, the cost for operations upon complex type can be hundreds of times more expensive than that of simple type.

If we set the price for operations based on the worst case (running upon complex types), Neo VM will be very expensive for ordinary users to use. Thus, almost all prices in the NeoVM is set without considering the worst case. Making attackers able to construct malicious scirpt to DoS Neo VM at relatively low price.

Aside from the price issue, there is another problem that makes Neo VM vulnerable to DoS attacks, which is the VM GC. To pricely count the living StackItems that resides in the ExecutionEngine, NeoVM introduces a GC mechanism to deal with the cycling-reference issue that will casue the counting inacurract. The cost of GC is related to the count of StackItem and the refernce relationships among them, it can be very cheap, but can also be very expensive. However, the cost of GC can only be determined at runtime, there is no way for us the predict how expensive the GC process actually is.

As a result, even with the same transactoin fee, the execution cost for a transaction execution can be very different. Some may only take a few milliseconds, some can be a few minutes even hours.

To address this problem, we proposes two protocols, DOPP and SOMP:

  1. Dynamic Opcode Pricing Protocol (DOPP): Since fixed price is vulnerable to DoS attacks, and the actual cost of operations upon complex types are related to the size and subitems of StackItem, we can decide the price of the operation at runtime, and calculate it with all related factors considered.

  2. Simple Object Management Protocol (SOMP): The reason of having GC is to ensure an accurate living StackItem counting, but actually, it does not matter if the counting has circularly referenced objects or not, it will not cause any security issue to the neo VM, just decrease the VM ability to process that specific transaction. And based on our evaluation, there currently is no circular reference ever happend in the neo at all, both mainnet and testnet. Thus we propose SOMP, that manages StackItem without GC.

Case Studies

Benchmark Samples

Compound Type Operations with GC

Item Count NEWARRAY UNPACK CLEAR
32 3.958 μs 5.960 μs 16.731 μs
128 5.721 μs 10.803 μs 48.562 μs
1,024 26.086 μs 190.762 μs 505.134 μs
2,040 32.465 μs 496.604 μs 939.693 μs

Compound Type Operations without GC

Item Count NEWARRAY UNPACK CLEAR
32 3.830 μs 4.735 μs 4.373 μs
128 5.643 μs 12.570 μs 9.457 μs
1,024 25.644 μs 29.246 μs 35.977 μs
2,040 31.522 μs 59.574 μs 81.830 μs

Analysis:

Backward Compatibility

This proposal requires a hardfork, since the price of operations will be different. Normally this will not cause any problem to smart contract, but if some contract depends on the accurate number of gas price, it will be affected and need an upgrade.

Drawbacks and Potential Risks

  1. Partial Mitigation of DoS Threats: This proposal can not solve all types of DoS attacks in NeoVM, there could still be other attack surfaces.

  2. Increased Computational Overhead: Calculating the operation price at runtime introduces extra overhead, though it balances the cost and price.

  3. Potential New Attack Surfaces: Without careful implementation, the changes might expose new DoS attack surfaces. And we removed the GC that handles cyclic references, so we must ensure that cyclic references do not cause problems.

  4. Impact on Existing Smart Contracts: Smart contracts that depend on precise gas fees for execution will be affected and may need to be updated.

Expected Impact

  1. Predictable Execution Time: Execution time can be predicted. For any valid transaction, with the same transaction fee and with all its fee executed, it will be executed for approximately the same amount of time.

  2. Consensus Stability: We can set an upper bound for execution time by limiting the maximum gas fee in transactions or blocks, enhancing consensus stability.

  3. Reduced Transaction Fees for Users: The transaction fees for ordinary users will be lower, because the base fee for opcode operations will be reduced, while attack costs will increase significantly, making malicious activities uneconomical.

  4. Simplified VM Architecture: Removing the GC will simplify the VM architecture and potentially improve overall performance.

Further Considerations

Suggestion from Alibaba: We can use an increase-only counter for references or an increase-only counter for StackItem creation. With a higher counter value that is above a bar, the opcode price will be higher, which increases the cost of attacks while normal contract execution remains unaffected.

Jim8y commented 2 days ago

@neo-project/core As the first version caused outrage, thus i updated this new version.

EdgeDLT commented 2 days ago

I think these are important changes.

SOMP seems like it would be easier to implement with immediate advantages and no backwards compatibility issues, so for me it's an easy one to move forward on.

DOPP is harder because we don't truly know the impact on deployed contracts until we start building the solution. So there is an upfront cost, then a delayed delivery because it will likely need fine-tuning to minimize impact on deployed contracts.

So my question would be... where do these land in terms of priority? Are either or both more important than solving state root in 3.8?

However, I would not say that Base Fee Reduction needs to lower the base fee compared to nowadays. I would say that the current price is the base fee without the variable component. The rest is an adjustment to the lack of correct price charging on such items.

@vncoelho I think the backwards compatibility issue is why the base fee would need to be lowered. Keep it the same and then add variable pricing on top and now everything gets more expensive. We should try keep prices the same (or cheaper) for optimized apps, and only complex, poorly optimized, or malicious apps ones get more expensive.