threefoldtech / tfchain

Threefold Chain.
Apache License 2.0
15 stars 11 forks source link

Improve off-chain worker logic #932

Open sameh-farouk opened 8 months ago

sameh-farouk commented 8 months ago

Describe the bug

https://github.com/threefoldtech/tfchain/blob/8dfb941263450b62c2b4eb5e3d51bc9f18ba9966/substrate-node/pallets/pallet-smart-contract/src/billing.rs#L66-L69

I believe that this method sometimes returns silently even though the worker runs on a validator that is supposed to author the next block.

I found that there have been instances where a validator has multiple aura keys in the key store (keys were rotated), which caused this issue since we use any_account(). To mitigate this, I requested that ops ensure that the nodes’ local keystore has only the relevant keys. Since then, things have improved.

However, we can also revise this part of the code and research if there is a better way to select the appropriate key and also make sure is_next_block_author() is reliable enough. This issue would also serve as a reference to a known issue in the current implementation.

sameh-farouk commented 8 months ago

Update: I see it happens on two validators on devnet that has one Aura key type in the keystore, so it seems is_next_block_author() is misbehaving. I'll try to figure out what is wrong here to fix or possibly rewriting it.

sameh-farouk commented 8 months ago

Update: This happens because is_next_block_author() assume that auraapi.authorities are same as session.validators Which is not always true based on how the keys was setup on the validators. if the condition didn't passed billing skipped.

https://github.com/threefoldtech/tfchain/blob/8dfb941263450b62c2b4eb5e3d51bc9f18ba9966/substrate-node/pallets/pallet-smart-contract/src/billing.rs#L727-L738

Every validator account has an associated validator ID, in some cases this may just be the same as the account ID. on other cases, the validator ID would be the account ID of the controller.

I tracked the validators that have these issue (author account != validator account). marked below with **:

Devnet
runtime
2: auraapi.authorities: vec<authorityid>
[
  5EvbAkU2fMiWXGrAmnUTi4zdVNXTa9cpamnhakuQcSMYSnpT
  5GgRLHTmSKBUhqE34DHVgBWZrRbG4xRWLxiCxTPeq37jDHax
  5FL6PmDmouGwKHq6nHJ5sFTS4RWHZ5SbPTA6i2gK6zBfbAH8                              
  5EJU9UKKqmLjbUTTPGWoDULUFvJ2jXsvkEz7KuAAJR4vYnvi                                          
  5CdTbqTUM8hQngLJ2awxDBE6WpxccUJcu9iWSEGtm9XKuz3f
]

session.validators: Vec<AccountId32>
[
  5EvbAkU2fMiWXGrAmnUTi4zdVNXTa9cpamnhakuQcSMYSnpT
  5GgRLHTmSKBUhqE34DHVgBWZrRbG4xRWLxiCxTPeq37jDHax
  **5DZVVFwJyEKEAUtrwmvPxkWzvjtFc73LybFK8xJse57eBP4X**                                              
  **5FXAK4MktAf5dQZURaC5sdRWZNgJPr8zB6iFr3Cbi1j4Svjv**                                                      
  5CdTbqTUM8hQngLJ2awxDBE6WpxccUJcu9iWSEGtm9XKuz3f
]

QAnet
runtime
1: auraapi.authorities: vec<authorityid>
[
  5CJo3yK6pt5XRAcr2Rh3uYqej4NUQcoB6UVDByjdaCzs7Xxg
  5EEt6fHShfg7haY9TaeC4tAM9GavSxgy82637zqdSGNyLFvv
  5DaCgDYD4crAkttVYBHQvs9uHqBK9afiX3g4UkubfH18EkSw
  5E1euBd4m4zG47BdZTUH6UgLLY9ry59C97ZSwh7ghHcBk9dB
]

session.validators: Vec<AccountId32>
[
  5CJo3yK6pt5XRAcr2Rh3uYqej4NUQcoB6UVDByjdaCzs7Xxg
  5EEt6fHShfg7haY9TaeC4tAM9GavSxgy82637zqdSGNyLFvv
  5DaCgDYD4crAkttVYBHQvs9uHqBK9afiX3g4UkubfH18EkSw
  5E1euBd4m4zG47BdZTUH6UgLLY9ry59C97ZSwh7ghHcBk9dB
]

Testnet
runtime
2: auraapi.authorities: vec<authorityid>
[
  5G3t9VaCQf5M2PawGz3iENYMwzpbMkaqZ1nxavnEaH2oPa3K
  5DnMaGMuM7Zp5yY2Cir8nzk8nmrk5nmTG4GKe7iK3PirVdEm
  5CWzZyAvuEDZrQ1y3ZmpE1XgNrd8Wtg8VEaC4nFfnimT8DZD                              
  5Ejk42UCKWuTqqfXkckyDgTggLXXgyDxGWEUQj7etHHuugjB                                  
  5DS9B4BeVXQ5rf7qn18zRXN7QjoysHp7tn8iBsQtXnNJ9FHA                                      
  5FYi5D3ZzfG3wEGzSHEXnkj85FK2ovNmnG4bBzYQZWfHAFM9                              
]

session.validators: Vec<AccountId32>
[
  5G3t9VaCQf5M2PawGz3iENYMwzpbMkaqZ1nxavnEaH2oPa3K
  5DnMaGMuM7Zp5yY2Cir8nzk8nmrk5nmTG4GKe7iK3PirVdEm
  **5GNEmtvjdh6cq8C7FMsWC1azqxKbiYZZLR6EL4MzzF7q5GWs**                                  
  **5H3EtZwEDK9p9v8Udr8uXW1LbpBHiNxDj5z6cnbtMDxX9sMb**                              
  **5DctbafPVjTWmDsfx85LHJVNd7r3dQX9mvWxmQHxZ8whdmjH**                      
  **5GEquvwY5SfsmW4psN94orfs1X1i7GG4gw4Lc4JYMGcJ6j2J**                                      
]

Mainnet
runtime
1: auraapi.authorities: vec<authorityid>
[
  5FqHp29vhYLcAyou8nNGY9T4S8bPTyMQq5GwqkvXBgb81Qjs
  5DkuYHpdkAThvgLrt9zdPh2DATodb7cXkWihzx6Toh5bbLim
  5CkyTonPpVADmwCWJhSvS8vNwZ8QtwQFjSRA5ojLvA9EiFu4                                  
  5DkHjGE4b9SCyD6h1Jr3vhJsDvyLhjPreHzRf5CLHEidgMpZ                                              
  5EZkx6b5GUN7BGFwM8UHh7NRpgFXMvtidtHg5LqA6FzxmVuW                      
  5Gn5ceQKEeNN1XdhhhcnN1hhQLiHmX1Ymi5kgvH2bJpJtyd4
  5DiF4BSDP11Hit71AeN1HKx71BBMcz4f2QheYBc2RiaHgTCj                                          
  5Eo9uD6bpcQq9LYCHegnM2ZkbFRiM5QW9hv6wU97gyj9NKAn
  5H3JhEboYXaFxFMMBN26kurzZzmeGMjQceqjdhhUGuCwKy2x
]

session.validators: Vec<AccountId32>
[
  5FqHp29vhYLcAyou8nNGY9T4S8bPTyMQq5GwqkvXBgb81Qjs
  5DkuYHpdkAThvgLrt9zdPh2DATodb7cXkWihzx6Toh5bbLim
  **5EcBCAoBYkK7J83b7Fhe17uQJVFDvJMxUudyxnAsHCtea64N**                                  
  **5EZhyC4GR4nxGU9qs4oMgipX4UCr4VdGRK2co4fbah2oz9n2**                                  
  **5H6Ao7g57BZ2HaeikpDEbpMVMuZDSD64Qz9etuvU1bTsivgG**                      
  5Gn5ceQKEeNN1XdhhhcnN1hhQLiHmX1Ymi5kgvH2bJpJtyd4
  **5F6jW3dsjEx1TfJLmrtS2V9ftyCvPe5TjbHMoKisCWMyR2Qh**                      
  5Eo9uD6bpcQq9LYCHegnM2ZkbFRiM5QW9hv6wU97gyj9NKAn
  5H3JhEboYXaFxFMMBN26kurzZzmeGMjQceqjdhhUGuCwKy2x
]

solution 1:

solution 2:

I already discussed with @coesensbert the solution 1 as it is the fastest one, given that about half of our main net validators skip billing ATM.

Also it is good to communicate that there will be a spike in billing? All affected contracts will be charged the big due amounts suddenly when this get fixed @xmonader

I will update also the relevant operations ticket.

sameh-farouk commented 8 months ago

We went with solution 1 and set a new session key for affected validators on devnet to meet the assumption that the aura key is the same as the controller key to detect the next author. The billing triggered successfully for the affected contracts as expected. Testnet and Mainnet will follow..

sameh-farouk commented 7 months ago

Update: All networks have been set up following the same process mentioned in the previous comment. This should be enough to make billing runs fine and the issue can be closed at this stage.

But I want to delve into some thoughts I have regarding this:

The offchain API not support checking if the offchain worker running on the next block author. Dylan’s code serves as a workaround (a good one indeed), but it’s essential to be cautious regarding key setup:

Furthermore, even if we ensure that only the offchain worker running on the next block author submits these transactions, (I think) there’s still a possibility that they will not be included in the transaction set chosen by the validator for that block, especially in busy/congested network conditions where there are many pending transactions in the TXs pool already.so it theoretical can end up in another block.

The main question is: What benefits do we gain by checking that this validator becomes the next block author? Currently, this remains unclear to me and to determine the right approach, I need clear requirements. Can you share your motivation behind this check @DylanVerstraete ?

Consider the following scenarios:

As soon as I get a clearer respond of why we initially decided to implement this check, I can revisit the issue to verify whether this check truly meets the requirements at this time, if it’s still valid, and how it can be further improved.

DylanVerstraete commented 7 months ago

What benefits do we gain by checking that this validator becomes the next block author? Currently, this remains unclear to me and to determine the right approach, I need clear requirements. Can you share your motivation behind this check @DylanVerstraete ?

Main motivation was that a validator submitting the extrinsics would be returned the fees. Otherwise this could lead to the validator account being drained.

I don't believe this extrinsic can be made unsigned, there were some issues with that approach.

One that accepts unsigned transactions from validators (no fees).

=> there is no way this extrinsic cannot be abused by anyone else other than validators, since the transaction is unsigned you cannot know who is calling it

To me, this was only a temporary implementation anyway. If billing triggers could be implemented in Zos nodes I think that would be better, so validators (read network) would only need to check validity of that transaction. (Has some drawbacks since ip4 would not be billed periodically if a node is down)

sameh-farouk commented 7 months ago

Main motivation was that a validator submitting the extrinsics would be returned the fees. Otherwise this could lead to the validator account being drained

If all validators send transactions and all fees go every time to one of them (block author) how the validators’ accounts can be drained?

there is no way this extrinsic cannot be abused by anyone else other than validators, since the transaction is unsigned you cannot know who is calling it

You can sign and submit a payload to the unsigned transaction and verify it on runtime to know if it was signed with a key belonging to a validator.

DylanVerstraete commented 7 months ago

If all validators send transactions and all fees go every time to one of them (block author) how the validators’ accounts can be drained?

Distribution of contracts to be billed are uneven

You can sign and submit a payload to the unsigned transaction and verify it on runtime to know if it was signed with a key belonging to a validator.

You can

sameh-farouk commented 7 months ago

Thank you, Dylan. It's much clearer now!

sameh-farouk commented 6 months ago

Update: After researching this, I found that relying on the next block author check in this case is unreliable and can cause more harm than good.

Let me explain why. First, the check is happening from the off-chain worker which is a process that starts on the validators every time it imports a new block and can run as long as it needs without affecting block production time because it is decoupled from the the executive module that is responsible for authoring the block, they just runs in parallel. This means it is possible and valid the off-chain worker runtime can span over multiple blocks and the check will be only true for a subset of the contracts that should be billed by this validator (we iterate over the contracts IDs on the current shard and pre-check per iteration if the condition is true before sending the tx). We could hit this issue if the number of contracts on any billing shard exceeded the time the validator takes crafting his block. so billing contracts can be a hit-or-miss process.

Second, Dylan mentioned that the reason for this check was to prevent validators from paying TX fees. However, submitting the transaction when the validator is the next block author does not guarantee that the same validator will select it from the TX pool. In congested network conditions, many TXs are submitted to the validators' TX pool. As a result, the transaction submitted by the off-chain worker running on one validator can end up in another validator's block, resulting in paying TX fees to that other validator.

Suggestion: It might be better to have all validators submit signed transactions and move the verification to the runtime. If the origin is any validator, fees would be skipped. Also should use some sort of verification to prevent the call from getting executed multiple times for the same contracts in the same block unnecessarily.

Outcome for the suggested flow:

sameh-farouk commented 1 month ago

Update: The is_next_block_author function was removed, and the verification now happens at runtime. If the caller is a validator, the transaction fee is waived. This change allows any validator with off-chain workers enabled to send transactions, making the billing process more resilient. It no longer requires all validators to have off-chain workers enabled.

To prevent duplicate transactions, billed contracts are tracked in SeenContracts storage. This storage is cleared in on_finalize to prevent unnecessary use of storage space.

Summary:

These changes ensure a more robust and efficient billing process. The PR still WIP

Here is an example that demonstrates the unreliability of the current approach: https://polkadot.js.org/apps/?rpc=wss://tfchain.grid.tf/ws#/explorer/query/13378163

You can notice that different validators submitted the billContractForBlock transaction at different billing indexes and ended in the same block. Unfortunately, only one belongs to the current author, and the others incurred fees.

sameh-farouk commented 1 month ago

Update: I conducted some tests on the new approach, and what works well now is that all validators who call billContractForBlock are exempt from transaction fees, regardless of whether the transaction ended up in a block produced by the sender or not.

Also, as long as at least one validator is running an off-chain worker, billing should work fine.

Kinda a blocker that need more research: The new de-duplication mechanism is not perfect. Under the new approach, if multiple transactions try to bill the same contract within the same block, only one of these transactions will succeed. The others will fail because of the new SeenContract check at the beginning of the runtime dispatch. This approach can help save node resources by stopping unnecessary work early. However, I think it is not ideal to fill blocks with failed transactions, especially when there are a lot of off-chain workers enabled.

I am currently looking into implementing the validation before the execution phase, specifically at the stage of transaction pool checks. This validation occurs early on, when the transaction is examined for factors such as a valid origin. If the transaction fails this validation, it will not be included in the transaction pool or shared with other nodes.

https://github.com/paritytech/polkadot-sdk/blob/master/substrate/primitives/runtime/src/transaction_validity.rs#L255

Researching this would extend the time required for this fix, but if successful, it would better meet our requirements.

sameh-farouk commented 1 month ago

Update: I observed weird behavior while testing, when offchain worker transaction fails with low priority error every 600 blocks / era. So contacts that suppose to bullied at index 599 are always skipped. Still investigating this.

sameh-farouk commented 1 month ago

Update:

I experimented with an alternative approach that introduces a SignedExtension to the pallet-smart-contract module that ensures unique transaction processing for the bill_contract_for_block function by setting a provides tag equal to the contract_id.

This allows me to achieve the de-duplication early at the TX pool level, instead of implementing it on the runtime dispatchable.

Key Changes

  1. New SignedExtension Implementation:
  1. Modification to Transaction Validation:

Benefits