Closed sameh-farouk closed 4 days ago
Update:
I see it happens on two validators on devnet that has one Aura key type in the keystore, so it seems is_next_block_author()
is misbehaving.
I'll try to figure out what is wrong here to fix or possibly rewriting it.
Update:
This happens because is_next_block_author()
assume that auraapi.authorities
are same as session.validators
Which is not always true based on how the keys was setup on the validators. if the condition didn't passed billing skipped.
Every validator account has an associated validator ID, in some cases this may just be the same as the account ID. on other cases, the validator ID would be the account ID of the controller.
I tracked the validators that have these issue (author account != validator account). marked below with **:
Devnet
runtime
2: auraapi.authorities: vec<authorityid>
[
5EvbAkU2fMiWXGrAmnUTi4zdVNXTa9cpamnhakuQcSMYSnpT
5GgRLHTmSKBUhqE34DHVgBWZrRbG4xRWLxiCxTPeq37jDHax
5FL6PmDmouGwKHq6nHJ5sFTS4RWHZ5SbPTA6i2gK6zBfbAH8
5EJU9UKKqmLjbUTTPGWoDULUFvJ2jXsvkEz7KuAAJR4vYnvi
5CdTbqTUM8hQngLJ2awxDBE6WpxccUJcu9iWSEGtm9XKuz3f
]
session.validators: Vec<AccountId32>
[
5EvbAkU2fMiWXGrAmnUTi4zdVNXTa9cpamnhakuQcSMYSnpT
5GgRLHTmSKBUhqE34DHVgBWZrRbG4xRWLxiCxTPeq37jDHax
**5DZVVFwJyEKEAUtrwmvPxkWzvjtFc73LybFK8xJse57eBP4X**
**5FXAK4MktAf5dQZURaC5sdRWZNgJPr8zB6iFr3Cbi1j4Svjv**
5CdTbqTUM8hQngLJ2awxDBE6WpxccUJcu9iWSEGtm9XKuz3f
]
QAnet
runtime
1: auraapi.authorities: vec<authorityid>
[
5CJo3yK6pt5XRAcr2Rh3uYqej4NUQcoB6UVDByjdaCzs7Xxg
5EEt6fHShfg7haY9TaeC4tAM9GavSxgy82637zqdSGNyLFvv
5DaCgDYD4crAkttVYBHQvs9uHqBK9afiX3g4UkubfH18EkSw
5E1euBd4m4zG47BdZTUH6UgLLY9ry59C97ZSwh7ghHcBk9dB
]
session.validators: Vec<AccountId32>
[
5CJo3yK6pt5XRAcr2Rh3uYqej4NUQcoB6UVDByjdaCzs7Xxg
5EEt6fHShfg7haY9TaeC4tAM9GavSxgy82637zqdSGNyLFvv
5DaCgDYD4crAkttVYBHQvs9uHqBK9afiX3g4UkubfH18EkSw
5E1euBd4m4zG47BdZTUH6UgLLY9ry59C97ZSwh7ghHcBk9dB
]
Testnet
runtime
2: auraapi.authorities: vec<authorityid>
[
5G3t9VaCQf5M2PawGz3iENYMwzpbMkaqZ1nxavnEaH2oPa3K
5DnMaGMuM7Zp5yY2Cir8nzk8nmrk5nmTG4GKe7iK3PirVdEm
5CWzZyAvuEDZrQ1y3ZmpE1XgNrd8Wtg8VEaC4nFfnimT8DZD
5Ejk42UCKWuTqqfXkckyDgTggLXXgyDxGWEUQj7etHHuugjB
5DS9B4BeVXQ5rf7qn18zRXN7QjoysHp7tn8iBsQtXnNJ9FHA
5FYi5D3ZzfG3wEGzSHEXnkj85FK2ovNmnG4bBzYQZWfHAFM9
]
session.validators: Vec<AccountId32>
[
5G3t9VaCQf5M2PawGz3iENYMwzpbMkaqZ1nxavnEaH2oPa3K
5DnMaGMuM7Zp5yY2Cir8nzk8nmrk5nmTG4GKe7iK3PirVdEm
**5GNEmtvjdh6cq8C7FMsWC1azqxKbiYZZLR6EL4MzzF7q5GWs**
**5H3EtZwEDK9p9v8Udr8uXW1LbpBHiNxDj5z6cnbtMDxX9sMb**
**5DctbafPVjTWmDsfx85LHJVNd7r3dQX9mvWxmQHxZ8whdmjH**
**5GEquvwY5SfsmW4psN94orfs1X1i7GG4gw4Lc4JYMGcJ6j2J**
]
Mainnet
runtime
1: auraapi.authorities: vec<authorityid>
[
5FqHp29vhYLcAyou8nNGY9T4S8bPTyMQq5GwqkvXBgb81Qjs
5DkuYHpdkAThvgLrt9zdPh2DATodb7cXkWihzx6Toh5bbLim
5CkyTonPpVADmwCWJhSvS8vNwZ8QtwQFjSRA5ojLvA9EiFu4
5DkHjGE4b9SCyD6h1Jr3vhJsDvyLhjPreHzRf5CLHEidgMpZ
5EZkx6b5GUN7BGFwM8UHh7NRpgFXMvtidtHg5LqA6FzxmVuW
5Gn5ceQKEeNN1XdhhhcnN1hhQLiHmX1Ymi5kgvH2bJpJtyd4
5DiF4BSDP11Hit71AeN1HKx71BBMcz4f2QheYBc2RiaHgTCj
5Eo9uD6bpcQq9LYCHegnM2ZkbFRiM5QW9hv6wU97gyj9NKAn
5H3JhEboYXaFxFMMBN26kurzZzmeGMjQceqjdhhUGuCwKy2x
]
session.validators: Vec<AccountId32>
[
5FqHp29vhYLcAyou8nNGY9T4S8bPTyMQq5GwqkvXBgb81Qjs
5DkuYHpdkAThvgLrt9zdPh2DATodb7cXkWihzx6Toh5bbLim
**5EcBCAoBYkK7J83b7Fhe17uQJVFDvJMxUudyxnAsHCtea64N**
**5EZhyC4GR4nxGU9qs4oMgipX4UCr4VdGRK2co4fbah2oz9n2**
**5H6Ao7g57BZ2HaeikpDEbpMVMuZDSD64Qz9etuvU1bTsivgG**
5Gn5ceQKEeNN1XdhhhcnN1hhQLiHmX1Ymi5kgvH2bJpJtyd4
**5F6jW3dsjEx1TfJLmrtS2V9ftyCvPe5TjbHMoKisCWMyR2Qh**
5Eo9uD6bpcQq9LYCHegnM2ZkbFRiM5QW9hv6wU97gyj9NKAn
5H3JhEboYXaFxFMMBN26kurzZzmeGMjQceqjdhhUGuCwKy2x
]
solution 1:
solution 2:
I already discussed with @coesensbert the solution 1 as it is the fastest one, given that about half of our main net validators skip billing ATM.
Also it is good to communicate that there will be a spike in billing? All affected contracts will be charged the big due amounts suddenly when this get fixed @xmonader
I will update also the relevant operations ticket.
We went with solution 1 and set a new session key for affected validators on devnet to meet the assumption that the aura key is the same as the controller key to detect the next author. The billing triggered successfully for the affected contracts as expected. Testnet and Mainnet will follow..
Update: All networks have been set up following the same process mentioned in the previous comment. This should be enough to make billing runs fine and the issue can be closed at this stage.
But I want to delve into some thoughts I have regarding this:
The offchain API not support checking if the offchain worker running on the next block author. Dylan’s code serves as a workaround (a good one indeed), but it’s essential to be cautious regarding key setup:
Furthermore, even if we ensure that only the offchain worker running on the next block author submits these transactions, (I think) there’s still a possibility that they will not be included in the transaction set chosen by the validator for that block, especially in busy/congested network conditions where there are many pending transactions in the TXs pool already.so it theoretical can end up in another block.
The main question is: What benefits do we gain by checking that this validator becomes the next block author? Currently, this remains unclear to me and to determine the right approach, I need clear requirements. Can you share your motivation behind this check @DylanVerstraete ?
Consider the following scenarios:
As soon as I get a clearer respond of why we initially decided to implement this check, I can revisit the issue to verify whether this check truly meets the requirements at this time, if it’s still valid, and how it can be further improved.
What benefits do we gain by checking that this validator becomes the next block author? Currently, this remains unclear to me and to determine the right approach, I need clear requirements. Can you share your motivation behind this check @DylanVerstraete ?
Main motivation was that a validator submitting the extrinsics would be returned the fees. Otherwise this could lead to the validator account being drained.
I don't believe this extrinsic can be made unsigned, there were some issues with that approach.
One that accepts unsigned transactions from validators (no fees).
=> there is no way this extrinsic cannot be abused by anyone else other than validators, since the transaction is unsigned you cannot know who is calling it
To me, this was only a temporary implementation anyway. If billing triggers could be implemented in Zos nodes I think that would be better, so validators (read network) would only need to check validity of that transaction. (Has some drawbacks since ip4 would not be billed periodically if a node is down)
Main motivation was that a validator submitting the extrinsics would be returned the fees. Otherwise this could lead to the validator account being drained
If all validators send transactions and all fees go every time to one of them (block author) how the validators’ accounts can be drained?
there is no way this extrinsic cannot be abused by anyone else other than validators, since the transaction is unsigned you cannot know who is calling it
You can sign and submit a payload to the unsigned transaction and verify it on runtime to know if it was signed with a key belonging to a validator.
If all validators send transactions and all fees go every time to one of them (block author) how the validators’ accounts can be drained?
Distribution of contracts to be billed are uneven
You can sign and submit a payload to the unsigned transaction and verify it on runtime to know if it was signed with a key belonging to a validator.
You can
Thank you, Dylan. It's much clearer now!
Update: After researching this, I found that relying on the next block author check in this case is unreliable and can cause more harm than good.
Let me explain why. First, the check is happening from the off-chain worker which is a process that starts on the validators every time it imports a new block and can run as long as it needs without affecting block production time because it is decoupled from the the executive module that is responsible for authoring the block, they just runs in parallel. This means it is possible and valid the off-chain worker runtime can span over multiple blocks and the check will be only true for a subset of the contracts that should be billed by this validator (we iterate over the contracts IDs on the current shard and pre-check per iteration if the condition is true before sending the tx). We could hit this issue if the number of contracts on any billing shard exceeded the time the validator takes crafting his block. so billing contracts can be a hit-or-miss process.
Second, Dylan mentioned that the reason for this check was to prevent validators from paying TX fees. However, submitting the transaction when the validator is the next block author does not guarantee that the same validator will select it from the TX pool. In congested network conditions, many TXs are submitted to the validators' TX pool. As a result, the transaction submitted by the off-chain worker running on one validator can end up in another validator's block, resulting in paying TX fees to that other validator.
Suggestion: It might be better to have all validators submit signed transactions and move the verification to the runtime. If the origin is any validator, fees would be skipped. Also should use some sort of verification to prevent the call from getting executed multiple times for the same contracts in the same block unnecessarily.
Outcome for the suggested flow:
It ensures that we bill all contracts that exist in any billing shard, regardless of how many.
It ensures that transaction fees are waived for any validator, even under heavy network utilization.
It ensures that contracts in any billing shard will be processed, even if one or more validators have the off-chain worker disabled (redundancy).
Update: The is_next_block_author function was removed, and the verification now happens at runtime. If the caller is a validator, the transaction fee is waived. This change allows any validator with off-chain workers enabled to send transactions, making the billing process more resilient. It no longer requires all validators to have off-chain workers enabled.
To prevent duplicate transactions, billed contracts are tracked in SeenContracts storage. This storage is cleared in on_finalize to prevent unnecessary use of storage space.
Summary:
These changes ensure a more robust and efficient billing process. The PR still WIP
Here is an example that demonstrates the unreliability of the current approach: https://polkadot.js.org/apps/?rpc=wss://tfchain.grid.tf/ws#/explorer/query/13378163
You can notice that different validators submitted the billContractForBlock transaction at different billing indexes and ended in the same block. Unfortunately, only one belongs to the current author, and the others incurred fees.
Update: I conducted some tests on the new approach, and what works well now is that all validators who call billContractForBlock are exempt from transaction fees, regardless of whether the transaction ended up in a block produced by the sender or not.
Also, as long as at least one validator is running an off-chain worker, billing should work fine.
Kinda a blocker that need more research: The new de-duplication mechanism is not perfect. Under the new approach, if multiple transactions try to bill the same contract within the same block, only one of these transactions will succeed. The others will fail because of the new SeenContract check at the beginning of the runtime dispatch. This approach can help save node resources by stopping unnecessary work early. However, I think it is not ideal to fill blocks with failed transactions, especially when there are a lot of off-chain workers enabled.
I am currently looking into implementing the validation before the execution phase, specifically at the stage of transaction pool checks. This validation occurs early on, when the transaction is examined for factors such as a valid origin. If the transaction fails this validation, it will not be included in the transaction pool or shared with other nodes.
Researching this would extend the time required for this fix, but if successful, it would better meet our requirements.
Update: I observed weird behavior while testing, when offchain worker transaction fails with low priority error every 600 blocks / era. So contacts that suppose to bullied at index 599 are always skipped. Still investigating this.
I experimented with an alternative approach that introduces a SignedExtension to the pallet-smart-contract module that ensures unique transaction processing for the bill_contract_for_block
function by setting a provides tag equal to the contract_id
.
This allows me to achieve the de-duplication early at the TX pool level, instead of implementing it on the runtime dispatchable.
SignedExtension
named ContractIdProvides
is introduced.bill_contract_for_block
function.provides
tag is set to the contract_id
of the transaction, ensuring it is processed only once per block.The transaction validation now uses the provides
tag based on the contract_id
, preventing duplicate transactions with the same contract_id
from being included in the same block.
provides
tag is to prevent the same transaction (or a set of transactions with the same contract_id
) from being included multiple times in a single block.provides
tag is used by Substrate's transaction pool to determine the uniqueness of a transaction. By setting provides
to the contract_id
, the transaction pool ensures that only one transaction with a particular contract_id
is included in any given block. If another transaction with the same provides
tag is submitted, it will be rejected as a duplicate.
The ContractIdProvides
extension is integrated into the runtime's transaction validation pipeline.
Describe the bug
https://github.com/threefoldtech/tfchain/blob/8dfb941263450b62c2b4eb5e3d51bc9f18ba9966/substrate-node/pallets/pallet-smart-contract/src/billing.rs#L66-L69
I believe that this method sometimes returns silently even though the worker runs on a validator that is supposed to author the next block.
I found that there have been instances where a validator has multiple aura keys in the key store (keys were rotated), which caused this issue since we use
any_account()
. To mitigate this, I requested that ops ensure that the nodes’ local keystore has only the relevant keys. Since then, things have improved.However, we can also revise this part of the code and research if there is a better way to select the appropriate key and also make sure
is_next_block_author()
is reliable enough. This issue would also serve as a reference to a known issue in the current implementation.