onflow / cadence

Cadence, the resource-oriented smart contract programming language 🏃‍♂️
https://developers.flow.com/cadence
Apache License 2.0
531 stars 139 forks source link

Ability to cancel script execution #1435

Open m4ksio opened 2 years ago

m4ksio commented 2 years ago

Issue To Be Solved

It should be possible to cancel the execution of the script. Once usecase would be to time-out script execution on access nodes.

Suggested Solution

Add standard Go context ( https://pkg.go.dev/context ) to ExecuteScript method in Runtime

bluesign commented 2 years ago

@m4ksio Why would access node time-out my script execution ? Isn't gas limit better for that?

m4ksio commented 2 years ago

@m4ksio Why would access node time-out my script execution ? Isn't gas limit better for that?

Gas-based limits are useful when we care about fairness. But sometimes all we care about is execution time. Also our gas model doesn't really work that great yet.

turbolent commented 2 years ago

@bluesign Script execution is a feature of the protocol, but concrete availability depends on the node operators. As script execution is computation, and computation costs money, operators may limit the amount through various means, one which could be time based.

bluesign commented 2 years ago

@bluesign Script execution is a feature of the protocol, but concrete availability depends on the node operators. As script execution is computation, and computation costs money, operators may limit the amount through various means, one which could be time based.

This can have some bad consequences like:

I think script execution is almost as important as transaction execution, some reliability there is very important

m4ksio commented 2 years ago

what if an AN decides to not run scripts at all or geo block some IPs

ANs are required to be available for everyone, or even provide any service at all. External-facing API is not part of the protocol.

some reliability there is very important

Yes but Cadence is used far beyond Flow core protocol. Ability to cancel script execution based on external is just another option to control execution, this doesn't force it to be used where it should be fully deterministic - ie tx execution.

turbolent commented 2 years ago

cc @ramtinms

ramtinms commented 2 years ago

@bluesign I see your points and there are no doubts scripts are as important as transactions in flow ecosystem.

I think one important aspect is having a way to pass control signals from upper layers to cadence is really valuable and we could save resources when the output is going to be ignored, even if extra timeouts are not added by the AN or ENs, client GRPC calls always have a timeout or connections might always be dropped. A common case that has happened before is a grpc call gets a client timeout on a really short default timeout and the client constantly retries many times causing an unnecessary load on ENs (while the client would ignore the earlier calls anyway) or the attacker sends requests and drops the connection immediately.

Even if we decide to add a script execution timeout to the Execution nodes (which I think is outside of the scope of this request), we are aiming to cover 99.999% of scripts and mostly keep time outs as a safety measure for attacks that figure out a way to keep their gas usage low but has a huge time and resource impact on ENs.

bluesign commented 2 years ago

Thanks @ramtinms, I totally understand the technical reason, and it is fair enough.

but concrete availability depends on the node operators.

my problem is this one actually, on one hand I am hearing something as 'staked AN' terminology, then it is kinda conflicting in my brain with this statement.

we are aiming to cover 99.999% of scripts maybe you will, but what is the motivation for node operator?

But to be honest it will be so much clear, if you ( as Flow Team ) can share a bit about what is in your head in the end game? Script execution, execution data etc, how this will fit in the big picture, in a byzantine proof way, as I am unable to see.

Actually I am constantly asking some information about big picture ( end game ) for almost 1 year

As a note:

attacks that figure out a way to keep their gas usage low but has a huge time and resource impact on ENs

this is also not making sense for me, then something is wrong with gas calculation. ( Though this discussion is another parallel I am really trying to avoid )

ramtinms commented 2 years ago

I see your points, I totally understand the confusion.

For the bigger picture, I would try to find the right set of people to provide a better picture of long-term improvements plans, since my knowledge is only limited to the process of execution and verification. But here is my understanding...

There are two services provided under execution scope: execution of transactions, execution of scripts (which should abstract away data storage implementation details from the user perspective). Although these two services are both provided currently by the Execution node (with the caveat of running scripts against the older states), from the protocol perspective, execution nodes would play the role of executing transactions, reporting results such as events and state changes, and another node group (e.g. access) type would play the role of access to the historic data through running scripts, and any read-only operations that don't require execution and protocol state changes.

From the BFT perspective, both of these two services would need protection and adjudication in case of infractions (such as integrity attacks, data availability attacks) by the consensus nodes.

For the first service (tx execution): the results of collection execution (execution state root hash changes, state change proofs, event root hashes changes, etc) are checked and verified by a set of verification nodes and any infraction is reported to the consensus nodes for evaluation. From the protocol and performance perspective, execution nodes are only responsible to answer data requests from access nodes, consensus nodes and verification nodes.

For the second service: the results of the script execution or raw access to the registers can be verified with proofs and similar to the previous service infractions can be reported for evaluation to the consensus nodes. This requires access nodes holds stakes inside the protocol.

Though since the second service is a read-only operation, several layers of similar software might provide post-processed and more advanced data query and delivery (on top of access nodes) and do the verification on behalf of users, which I think previously had the very confusing name of unstacked access (and I think its called observer), which are layers of services run outside of protocol and talks to the access nodes (stakes) and provides services to users (think of it how in the past each dapp developer had to run a GETH node or trust services (e.g. alchemy, etc).

On the topic of spam prevention, while currently, we measure computation usage of the transaction, the more improvements and more advanced versions of measures such as memory allocation usage, network overhead and weighted execution fees calculation are currently under development. While this should prevent spam preventions and provides controls to the governance layer to adjust the fees accordingly to prevent any future problems, meanwhile we need to provide deterministic tooling for the nodes to prevent themself from potential resource attacks, which would be unnecessary when advanced transaction and script metering are in place.

hope my notes didn't make it more confusing, and I'll make sure there are more documentation and communication out there.

j1010001 commented 1 year ago

To do this we need to add some mechanism to make the interpreter stoppable.

j1010001 commented 1 year ago

Doable in the interpreter, might be easier to do if/when we have a VM.

bluesign commented 1 year ago

Doesn't AN already timeout script execution? Is this for some kind of clean stop?

turbolent commented 1 year ago

@bluesign Yes, it already does it, but indirectly through metering callbacks. This issue is about adding proper support for cancellation at any point.