the chain shouldn't bill contracts on offline nodes

rawdaGastan commented 9 months ago

There shouldn't be a contract billing for a deployment against node of state down

https://github.com/threefoldtech/terraform-provider-grid/issues/763

renauter commented 9 months ago

Edit: OUTDATED

I had a look in code and, indeed, node state is not taken into account while billing (node state is only checked to prevent creating node/rent contracts while a node is down).

Since billing cycle is 1h, what we could do is considering to discard billing for this cycle if node state is down at end of cycle. So there is a potential "less than 1h" billing amount loss at this moment which will be compensated by a "less than 1h" billing amount win when node comes up. Are we OK with that? @LeeSmet @muhamadazmy @xmonader @sameh-farouk

Another important point would be to update lock_updated field in ContractLock object each time a cycle is not billed because it is used to calculate the elapsed time to bill for (see here). If not, after node comes up, user will still pay for a period that will include the down period.

sameh-farouk commented 9 months ago

Thanks for the report.

In this kind of issues, it’s essential to include the contract type, ID, node ID, and network information. the origin issue related to Terraform plugin so you can also attach any relevant logs or screenshots that might help us to diagnose the issue.

Now, given that we are talking about a node contract, To deploy a workload on grid node, a user needs to perform the following steps:

User create a contract on TFChain.
TFChain validates the contract, user, funds, and locks up the required public IPs from the farm object.
Once a contract is created, the contract ID and deployment are sent to the node (over RMB).
The node reads the contract data from TFChain, validates and applies them.
Deployment status can then be queried directly from the node (over RMB).
The Node periodically sends consumption reports back to TFChain for each deployed contract every x minutes.
TFChain periodically (based on the billing frequency of the contract) computes how much is being used and will bill the user based on the pricing policy configured in the farmer storage.

Now with this implementation in mind, can the chain bill a contract while a node is offline? Even a node is offline, all node contracts will be checked periodically, the chain can also bill a contract. However, it will only be billed for previous consumption that the node sent while it was up if it wasn't billed yet which totally valid. Also any previous locked amount will be periodically deducted every 24 hours from the user account and this also can be happens during the downtime of the node.

This does not mean that the user will pay for the service during downtime. The node is responsible for updating the chain with consumption reports, and no new reports can be sent while the node is down. Therefore, no further charges will be incurred.

Another guess is that node wasn't actually down, but not reachable over RMB due to this issue? in this case you can't communicate with node over RMB but node still report up-time. so how you observed that node state was down? and how you observed that the contract was billing specifically during the downtime?

To be able to look into this further, I will need more information on how you arrived at this conclusion.

Also I want to request @renauter input here.

LeeSmet commented 9 months ago

The chain doesn't track if a node is online or not, and it shouldn't since it doesn't care. The only thing provided on chain is the ability to submit uptime reports, so external tools can check if a node was online at some point in the past (and even then this is technically only an approximation).

As it stands, it should be noted that "a node" is a trusted actor on chain (as per the grid design), and thus we assume it submits valid usage reports for workloads. Other than that, a node being down is billed if it is a dedicated node being rented, or public IP's are being rented. In case of the former, it has been decided when the feature is implemented that this is client responsibility to cancel the contract in this case. In case of the latter, the billing is correct as the IP is still held on chain so nobody else can use it.

Unless there is some case where the chain bills contracts without proper trigger (in which case please clarify that and provide the required details to investigate that), this is all working as intended.

renauter commented 9 months ago

Thank you all for your inputs. To sum up, not checking if node is down during the billing is done on purpose in all contexts:

Node contract: since consumption reports (SU/CU/NU) are not sent when node is down we only bill for reserved public ips if any
Rent contract: assuming it is user responsibility to cancel contract if node id down, user will continue to be billed for resources and extra fee if any.

So we are all good

s-areal commented 9 months ago

This does not mean that the user will pay for the service during downtime.

Can you please confirm that user will not be billed for downtime nodes in node contracts. I feel somehow confused after reading @renauter and @sameh-farouk anwsers.

sameh-farouk commented 9 months ago

@s-areal Yes, no payment will be made for services or workloads on that node under node contracts (for any node resources), and I think @renauter shares my view.

The IPs are not tied to any specific node and are allocated from the farmer IPs pool when the contract is created, and they are returned to the pool only when the contract is terminated. Therefore, we keep charging for them as long as they are reserved by the contract, if any, regardless of the node status.

sameh-farouk commented 8 months ago

Closing for lacking essential info needed to diagnose the issue.

sameh-farouk commented 6 months ago

I realized that we made a mistake in thinking that this design only affects rent contracts. Node contracts are also impacted. We assumed that node contracts are billed only when there is actual consumption of IP/SU/CU/NU, which is true from tfchain’s perspective. However, from zos’s perspective (@muhamadazmy can correct me), SU/CU consumption are recorded only on workload deployment, changing and deletion, regardless of the actual up time. This is the part that we overlooked.

sameh-farouk commented 6 months ago

related to issue

LeeSmet commented 4 months ago

To the best of my knowledge, the (relevant) flow for node contracts should be like this:

contract is created and pub ip is allocated
contract is deployed by node
node sets the used CU/SU
node periodically runs the add_nru_reports extrinsic (in 1 hour intervals)
the actual uptime of the contract is calculated from the nru_consumption report
off chain validators call the bill_contract extrinsic, which properly calculates the cost and bills it. this account for the uptime to calculate how much SU/CU costs, adds NRU cost, and IP cost
repeat from step 4 until contract is cancelled.

This flow should only bill offline nodes for public IP's in use, which is correct since they are reserved on chain. NRU can't be billed since the offline node can't report NRU, and SU/CU should also not be billed. Of course, it is possible that there is a bug somewhere which causes SU/CU to be billed even if the node is down.

sameh-farouk commented 4 months ago

Of course, it is possible that there is a bug somewhere which causes SU/CU to be billed even if the node is down.

@LeeSmet This argument (called it a bug) is debatable. Based on the code and code comments, it appears that this particular concern was not taken into account during the implementation phase. Originally, contracts were meant to be billed for any incurred SU/CU, NRU, or IP costs.

IP costs: based on IP reservation.
NRU costs: based on network utilization reported by the node.
SU/CU costs: based on the deployment spec reported from node when it provision the workload, and the seconds elapsed.

https://github.com/threefoldtech/tfchain/blob/a0d30d08c5208b9d298ca7cd43e534cda3ce7d0a/substrate-node/pallets/pallet-smart-contract/src/billing.rs#L40-L62

Also about your suggested flow

the actual uptime of the contract is calculated from the nru_consumption report

I proposed a similar approach, but utilizing the node uptime reports, which I believe would be more pertinent. Why do you think node uptime should be inferred from NRU reports instead?

The obstacle lies in the fact that the chain runtime is unaware of the node’s online status and the actual node's uptime because we do not store this information in the chain state. Instead, we only pass it through events to the GraphQL.

So I can think of two options: 1- Store relevant info about the node uptime on chain to help chain runtime decide whether to bill SU/CU or not based on node actual uptime (curenntly it is done only based on seconds elapsed). Information needed to be stored should planed carefully as the billing and reporting done separately with different time window.

2- off load this to zos service and make it responsible for trigger the billing for all contracts deployed on it and provide the number of seconds to be billed taking into consideration last report and uptime.

LeeSmet commented 4 months ago

Also about your suggested flow

I don't know how it is actually implemented, only what was discussed and agreed. In this regard, if the implementation deviates from the agreed plan than I do consider that a bug. Also, I am not aware of any fundamental changes which were considered to this flow since it was agreed on.

The reason to use NRU reports is because they are a part of the billing flow regardless, and the way they track time is different. If uptime reports would be used, a contract would need to know exactly at what uptime it was deployed, when it was last billed, and need to handle node restarts (as those reset uptime). Uptime is also not tracked on chain as it stands, so that also needs to be added for this to work. On the other hand, in nru reports, the time reported is the time since last bill (capped to 1 hour iirc). This means that if a node can't submit a bill report on time, you can't get billed (since in general not submitting a report would indicate a network error, which means your workload is generally also not reachable/offline).

Offloading the billing to nodes has been considered in the past, but it was generally considered against because:

It adds functionality to the zos node which is technically should not really have. The job of the node in this flow is to provide capacity and report on its usage, not to decide how much a user should pay.
While zos nodes are currently trusted actors in the system, the aim is to reduce this as much as possible/eliminate this outright. Which leads to the following point
The current way provides some transparency. If a large bill is done on a contract, there is a chain of extrinsics/events which can be examined to verify how the amount was calculated. If all this logic moves to zos, it becomes a black box, and if a bill is disputed there is generally not really a way to figure out how that came to be (is it correct or a bug)

sameh-farouk commented 2 months ago

@LeeSmet Some considerations regarding NRU reports approach.

On the other hand, in nru reports, the time reported is the time since last bill (capped to 1 hour iirc). This means that if a node can't submit a bill report on time, you can't get billed (since in general not submitting a report would indicate a network error, which means your workload is generally also not reachable/offline).

ZOS only reports NRU when there is a consumption. if consumption is 0 then no report is sent. https://github.com/threefoldtech/zos/blob/33dd510e468d94390253319e1aae6c4935eee029/cmds/modules/provisiond/reporter.go#L401

ZOS also does not count any private NRU, only the traffic over public IP is reported. If a contract/workload doesn't have a public IP (e.g. uses Ygg, Wg, etc), the NU consumption will be always 0 and no NRU reports will be submitted for this contract. https://github.com/threefoldtech/zos/blob/33dd510e468d94390253319e1aae6c4935eee029/cmds/modules/provisiond/reporter.go#L414

For this approach to work, serious changes and decisions would need to be made to how ZOS submits reports and tracks metrics. Otherwise, workloads that do not have public IP addresses or have not sent or received any packets within a given time window would be considered offline, and would not be billed regardless of their online status.

threefoldtech / tfchain

the chain shouldn't bill contracts on offline nodes #873