xmonader commented 2 years ago

with the current energy prices, we need to find away to turn off nodes, and still avoid abuse

the current favored solution is using wake-on-lan however, this requires some enthronements e.g the farms need to be location based, physically in the same lan and the farms need to provide some hot capacity always available for the provisioning and the remaining can be cold capacity that are subject to random turnon/off procedures

issues

delandtj commented 2 years ago

Well I was going to create the issue here, but I'll add comments here

delandtj commented 2 years ago

Ability to poweroff nodes in the Grid

Preamble

Energy costs of a node is in all occurences something to take in account, but nowadays and specially in Europe, it is of vital importance for the viablility of the grid itself.
Costs for running a node, a part from networking overhead, is now basicaly the highest investment over 5 years, more so than the investment of the hardware itself. While it doesn't seem much calculated per month, future prices, even when 'the new normal' sets in with a new equilibrium, will be nowhere in the vicinity of the prices set a year ago.

The pursuit of dramatically lowering the energy consumption of the Grid in order to be green(er) now has another incentive: money.

While we all wish all nodes have their cpus, memory and storage maxed out and the world IT community is knocking on all our doors for more, we're not there yet.

Some farms are 'just' online. No workloads. Just generating tokens for the Farmer. In Principle, these nodes represent the investment of the farmer to the size of the Grid, and while some costs are invloved like housing, networking,.. there should be no need for nodes that have no workloads running to be powered on in the first place.

Powering nodes in function of necessity

Requirements

Powering off a node should be straightforward: When a node has no workload at all, it can be shut down.

Question:

Do we allow for a grace period before shutting down?
Who shuts the node down (zos or 'the grid')?
How many nodes in a farm before we shut down nodes?

Powering on a node will be needed to be done by a smarter provisioning scheme in function of the size of a farmer

We only support powering off nodes in farms that have 1+n nodes in the same network.

One node in a farm will always be on, hosting the poweron service.

For powered off nodes to be off as long as possible, finding a node to deploy a workload becomes a bit more hairy, and we'll need a lot of verification that a recent powered on node is properly started and capable to host workloads.

Powered off nodes will generate the same amount of tokens as if powered on, but need to be regularly (randomly?) powered on to formally ackowledge their existence.

Seen the sheer number of interfaces that PDU brands have, it would be virtually impossible to support them all, so powering off/on should nog be done with PDU. The more, we can't surmise that a farmer always has a pdu for powering his nodes. So no PDU

Technical

Poweroff For a node to be able to be powered off, we'll need:
- acpid in ZOS (sorely missing since forever)
- in function of whether we are going to shutdown by zos itself (zos sees no workloads, thus node shuts down) , we need adaptations for zos and some daemon to tell zos to go to sleep.
Poweron Powering on a node will be done with WoL (Wake-on-LAN), 95% of motherboards support that feature, for nodes that don't we'll need a kernel command line to not be part of the poweroff/on scheme (nopoweroff). So to be able to WoL a node, we'll need:
- the bios to supports it and being set
- a daemon that can receive messages from 'the grid' to poweron a node, that deamon running on all nodes that are live in a farm on the same network.
- provisioning that selects nodes to start in function of required workload
- probably different remuneration when a user selects a farmer and imperatively wants to have a node booted.
Finding nodes for provisioning in function of farm. Provisioning is now set by the user, where he selects a node to deploy a workload on. We should be able to specify a farm instead of a node. That way, the provisioning can select a node that still has resources available and is powered on. Until all resources of a node are utilized, only then power on a node to make more resources available. The fact that powering on a node while specifying the farmer who still has powered off nodes, while omitting to search for farmers that have powered on resources should be more costly. (is that part of requirements?)
Verification Nodes that are elegible for powering off, and consequently are powered off, should be powered on each day and report the time they were effectively powered off, together with the time it effectively was powered off. Alternatively, we could ask the WoL daemon in the farm (every powered node in a farm has one) to boot a specific node at random to alleviate pressure on the hub if we should start all nodes at the same time.
Token remuneration Nodes that are powered off are in essence the same as powered on nodes in terms of uptime, so we'll need to come up with a solution for that.

delandtj commented 2 years ago

I added 'Technical' that people can fill in implementation details and issues in the concerned repos

maxux commented 2 years ago

Just checked, acpid is not needed, we can handle that via zinit directly :)

muhamadazmy commented 2 years ago

Suggestion on how this can work on ZOS.

Farmer (or the chain) has to elect a single node to be the power master. This node will never shutdown.
- Once a node is elected the power daemon on this node will run in a special manager mode.
Other nodes will run in a follower mode, just listens to shutdown events.
The idea is that the manager will listen to chain (state) the power management decision need to be taken from the chain to avoid race (a node that has active contract can't be turned off even if itself still didn't receive the deployment)
Once the daemon knows it needs to shutdown a node down it then can query the local node IP by sending the proper RMB command to get network setup. The idea is we can now if a node lives in the same LAN and not away if it has the same subnet.
Once the local IP is known, we can connect send and HTTP request to power off. this request has to have the following information:
- signature of the power manager node, this means node can reject fake request initiated by unknown sources.
- it needs to has the ID of the target node, this is to make sure we don't land on the wrong node (if we have multiple power managers in 2 different collision domains but have the same subnet)
If the request is succeeded
- Node can prepare to power off by sending the proper state to chain and last uptime before power off.
- Node can set state to PowerOff() so we know exactly who triggered the power off of the node.
Power managers that powered off node are responsible to wake them up on:
- 1) randomly to validate nodes still exist
- 2) chain requires target state of the node to be UP.

note about chain state:

I think nodes can have power management attribute attached to it that can have 2 attributes:
- TargetState: Up or Down
- CurrentState: Up, Down(requester, time)

If a node is woken up to find out that it's target state is Down it can simply send the uptime report and go back to sleep automatically. This will make it easier for the power manager to randomly nudge node to proof their existence.

muhamadazmy commented 2 years ago

The idea behind having the power manager send the power off decision to the node although the node just can check its own target state is that this validate that the target node is reachable by the power manager. hence can be walking up again.

This solution will make it okay if u have multiple LANs that join the same farm. a farmer can then have multiple power manager selected in each LAN with no issue.

DylanVerstraete commented 2 years ago

I think we also need to rework the way deployments are created. I think the user needs to have an agreement with a farm rather than a node. Since a user can only use online nodes in a farm, and the user doesn't really know in advance which nodes these are. If we keep supporting the NodeContract(nodeID) a user can create a node contract for a sleeping node in a farm and never have it's workload deployed.

I think the managing node should also act as provisioning manager. The user should be able to create a contract with a Farm and the managing node should see this contract being created and redirect the contract to a node that is able to accept the workload.

Maybe if we keep the NodeContract the chain can actually check if the node is up or down and return an error to the user in case the node is down.

muhamadazmy commented 2 years ago

The user can know the state of the node from the TargetState of the node (Up, or Down) so he is free to choose a node that is UP from the start If a user choose a node that is down the chain can then bring the state of the node up. This will take sometime to bring up fully of course. Hence the user need to know he has to wait until the node is up before talking to the node directly

DylanVerstraete commented 2 years ago

So per discussion with azmy;

We can extend the code on the chain where the create_node_contract takes into account the following things:

It will check if the contract can fit on the specified node, if not it will trigger a random "down" node in a farm to boot up and set the node id of the contract to the booted up node. This is only possible if the user specifies the used resources on contract creation!
If the user tries to deploy on a node that is marked as "down" he will get an error that the node is unavailable at this time.

Questions

This brings up the question if we actually need to specify a node id on contract creation or actually a farm ID. If we provide a farm ID and Resources to contract creation the chain can select the node for the user.

This also is in contrast with the proposed solution for capacity planning here: https://github.com/threefoldtech/home/issues/1304#issuecomment-1245225199

LeeSmet commented 2 years ago

The main problem I have with a farmer elected manager is that this introduces a single point of failure in the design. On the contrary, if the logic to select nodes to poweroff is idempotent, no single central manager is required. Depending on farm size, multiple nodes can be left operations, which can then use a slot based leadership system to decide who will handle which events.

DylanVerstraete commented 2 years ago

@LeeSmet what about the capacity planning? What are your thoughts on above comment?

delandtj commented 2 years ago

There are two types of events:

Poweroff For poweroff, I'd say that an event on the chain should be monitored for. Having a node generate an http/rpc/whatevvah to another node in the same network is just adding to complexity. simplicity is key here. Document the fact that nodes in a farm need to live on the same LAN is more than good enough. So for poweroff there is only the grid that decides, keeping at least one live.
Poweron Simplicity applies here too. Powering on is just sending a Magic packet to a mac address. That's it. So if we need farm-based provisoning for this to work, all powered on nodes can listen for an event specific for that farm, and just send out the magic packet. no master election, no slots, no nothing. All live nodes barf out a few magic packets for that mac and the node will boot. Heck, do that even for a few minutes and verify if at a certain point the node that is woken up replies to something or registers something on the chain

Another thing: naming is important: we already have down as not reachable, or not available in any way. Shouldn't we call it 'sleeping' or something like that ?

delandtj commented 2 years ago

Contracts per farm, indeed, that way no-one can generate workloads like Network Resources just to start all nodes in a farm

muhamadazmy commented 2 years ago

@delandtj

If we enforce the rule that a single farm need to exist on the same LAN then indeed we can drop most of the complexity nodes can listen to their own power off signal and make the grid the solo manager of the power management. Bringing a node up should then generate an event that can be picked up by all nodes in the same farm, hence they all can generate the magic packet to wake up their sleeping friend.

We still need then to discuss how the grid gonna decide what nodes need to go to sleep, and on what conditions it can bring them up again.

Also regarding having the deployment contract with the farm itself, and not the node. The grid then need to still select a node and assign it to the contract (and possibly brig it up) which means capacity planning entirely has to happen on the chain (which i don't mind if we already have all the data). Once node is selected and assigned to a contract. The user then need to "wait" until the node status if fully up before he can contact the node to actually deploy his stuff.

Those changes combined (imho) are a major change to the grid (hence a new major version?)

scottyeager commented 2 years ago

Only nodes that support TPM will be able to be powered off.

What's the thinking behind this requirement @delandtj?

muhamadazmy commented 2 years ago

@delandtj @DylanVerstraete and @LeeSmet we really need to agree on the final approach to be able to create the related (technical) issues. Could you please read my previous comment, and comment if this (technical wise) is good?

DylanVerstraete commented 2 years ago

Looks good yes. I only think the user experience will get worse with this power management feature. If the user wants to deploy on a farm that needs to boot a node in order to host his workload then he possible will have to wait for like 5-10 minutes..

delandtj commented 2 years ago

So in a nutshell: what do we convene over this ? I mean, we need to set in stone also what the implementation details will be.

delandtj commented 2 years ago

Looks good yes. I only think the user experience will get worse with this power management feature. If the user wants to deploy on a farm that needs to boot a node in order to host his workload then he possible will have to wait for like 5-10 minutes..

This can be messaged

muhamadazmy commented 2 years ago

Okay, i will try to write down a dump of all changes that are required based on our meeting regarding capacity planning with power management: Since nodes will be sleeping, a user can not choose a node to deploy, it's up to the grid to find the most suitable node with the option of bringing nodes up if needed.

Creating a contract

The user create a contract with (farm id) not (node id) it's not up to the farmer to choose the node it's decided by the farm (explained later)
The user need to specify a required capacity on contract creation. This capacity is given in computer, memory, and storage units (CU, SU, HU, and MU)
- Later a deployment can only utility up to given capacity but not exceed it. bills are computed based on this used capacity (plus reported NU usage by the nodes in later stage)
Once contract is created user has to wait until a node is assigned.
Node Assigning
Once a contract is created, all of the farm up nodes will receive an event that a contract is created that needs a node.
Nodes will then use all information from the grid (what nodes in the farm and what nodes are up, or down) to compute what is the best node to deploy this contract on.
if no more capacity is available, a "sleeping" node needs to be woken up (may be choose the one with the available capacity with lowest id)
update the contract with the node id
- since multiple nodes can try to do this update, the grid need to reject contract updates after the first node is selected and set. the grid need to validate that this update call is indeed one of the farm nodes.
- (may be later a synchronization mechanism can be implemented so not all farm nodes do the same computations every time a contract is created.
- Once node id set, the grid need to make sure node power status is set to up.
- If the selector node (the one that updated the contract with the target node id) has succeeded to set the contract node id (this can be same node) it can send a wake up on lan signal to bring it up anyway. this is safe to send regardless the power status of the node.

Deployment

Once the contract node id is set, the user is ready to contract the node to deploy his contract as usual.

Notes

nodes now doesn't need to update the contract with deployment used capacity since this is selected by the user from the beginning.
if a deployment requires more capacity than what is assigned to the contract, the node can simply return an error on deploy.
billing now is done against the reserved contract capacity even if actual deployed is less than reserved.

Changes related to capacity management

Changes needed to `zos`.

[ ] Validation of deployment required capacity against contract on the chain
[ ] A more robust events mechanism since right now we rely on the ws connection, it means possible events loss on connection drop. (This was fine before because the node resyncs on connection restoration) but we instead should do block scan instead (how heavy is that?)
[ ] Build handler to the contract creation event to find the best suitable node

Changes needed to chain

[ ] Contract type changes to have farm-id, optional node-id, and reserved capacity
[ ] map to list nodes by farm id (currently this information is available on graphql only but we need to have it on the chain for decentralization
[ ] @DylanVerstraete please add more ?

Note, those changes are related to capacity planning only and not the entire power management story.

muhamadazmy commented 2 years ago

after a little discussion with @DylanVerstraete we agreed on the following: To improve events processing, we will also keep a map of contracts that are created (per farm) that still need node-id which means if events stream is interrupted the node can still check that the state of the map was not changed. Contracts that get their node id are removed from the map.

AhmedHanafy725 commented 2 years ago

on the node assigning, IMNSHO it will be needed to be able to deploy on different nodes that for something like kubernetes clusters(it shouldn't be deployed on the same node)

muhamadazmy commented 2 years ago

@AhmedHanafy725 yes, you are right. @rkhamis brought this up during the meeting and I forgot to document it here in the issue. I had a suggestion is to create a special type of contract. can be called ClusterContract. which is basically a set of contracts + a policy. Once created, the capacity planning process will know (based on the policy) that those contracts can not be deployed on the same node then each sub-contract is assigned a new node.

The process can go like this

Create a cluster contract with a policy (Say default policy is "mutually-exclusive" -not sure if that the best name-) which means sub contracts can't be deployed on the same node. This cluster contract itself is free, and does not reserve any capacity.
Then you continue by creating Node contracts normally but you set it's cluster ID to your cluster contract ID.
On each contract creation the capacity planner will make sure the node is not already used by any other contract in the same cluster.

muhamadazmy commented 1 year ago

We had discussions regarding real life use cases (k8s cluster, and separate network workloads): A contract object will have this new attribute

policy this policy is an enum of the following values:
- any: it's up to the capacity planner to find a suitable node, no restrictions except the required resources capacity
- join(contract-id): means this contract must use the same node id as per this given contract id. in other words they have to be deployed on the same node. If requested capacity can't be satisfied by the given node, contract creation fails with the possible error.
- exclusive(group-id): where a group-id is an id of a group object. A group object only has an id and an owner (twin-id) the id is used to group contracts. when using this policy all contracts that are using the same group can not have the node-id so each contract in the same group need to have a different node-id.

Use case:

On deploying k8s start by:

Create a group object
Batch create contracts for the VMs with the right capacity with an exclusive(group.id) policy.
Once all vms are assinged to a node id. build all network workloads where each has a join policy to the corresponding VM contract (you need a network next to each vm) . Note that no deployments are done yet.
Since your vm contracts already have a node-id you can pre-calculate your network workload config.
Once network contracts are created, deployment can start.

xmonader commented 1 year ago

the booting time according to Jan can be between 2-10 mins, which is .. bad. I guess that means the power manager will be the main node to provision resources on , and it needs to automatically boot other nodes when it reaches a specific threshold, but that's also quite cumbersome, e.g someone wants to a node with GPU and there's no reference of GPU on the power manager, meaning, the user may end up waiting 2-10 mins for the VM to boot.

Also, not all nodes are created equal, some could be specialized for cpu, ram, storage, or gpu, some sort of tagging notation might be needed to wakeup the right node(s)

muhamadazmy commented 1 year ago

Iteration over the power management

We need to assume that a single farm can span multiple lans this is an iteration over this comment

Node

each node object has

enum PowerTarget{
    Up,
    Down
}

enum PowerState {
    Up,
    Down(leader_id)
}

struct Node {
    power: Power {
            target: PowerTarget,
            state: PowerState,
        },
    ...
}

General case

Let's assume a Farm F that lives in multiple LANs (collision domains). as follows:
segment 1 S1 has node [N1, N2, N3, N4]
segment 2 S2 has nodes [N5, N6]
segment 3 S3 has nodes [N7]

nodes can find about all direct neighbors nodes by simply getting information about all nodes in the farm, then try to reach them over the local zos ip. An HTTP service that is only available on local zos interface, the service need to return a signed response this way we can grantee a node is exactly what it claim to be. (to avoid situation where nodes on different segments has the same private IP). In the example above N7 and N4 for example can has the same private IP.

This way each node can learn about it's immediate neighbors that lives on the same segment. For each segment at least single power manager is elected, election is very simple:

Any node(s) with public-config becomes the leader
Othewise, the node with the lowest ID

Hence in the example above:

S1: -> N4 is selected because it has public-config
S2: -> N5 is selected because it has the lowest ID in that segment.
S3: -> N7 is selected because it's the only node in the segment.

Then:

A leader node will never shutdown. even if grid set the power.target to Down
If a leader received a power down event for a neighbor node. the node will use the local http api on the target node to communicate request to power off. This will work only if:
- Node is still reachable on this segment
- Node is indeed the right node (validation of signature)
- Target node will accept power off request only from neighbor nodes (verify the signature) that is indeed a node from the same farm.
- If all verification passes, the node will set it's state on chain to Down(leader) where the leader is the ID of the node that requested the power off.
If a leader received a power up event for a node. If the node state is set to Down(id) where the ID is my own id. it means this is the node the requested the power off for that node. Hence it can then send the WOL package.
On receiving a WOL and the node is fully booted it does the following:
- If target power state is still Down node can then send an uptime and power itself off again. nothing changes. It's done like this to handle random power nudges for capacity validation.
- If target power state is set to Up the node update it's power.state to Up. and continue normal operation.

Now back to the example above. Let's assume this farm is completely free of workloads. Grid will decide that it can power off all nodes except the public node (N4). So let's say it sets all nodes target states to Down accept N4 (the public node).

If you follow logic above we will end up with following state:

N1{power.target = down, power.state: down(N4)}
N2{power.target = down, power.state: down(N4)}
N3{power.target = down, power.state: down(N4)}
N4{power.target = up, power.state: up}
N5{power.target = down, power.state: up} <- while target is down the node will never shutdown because it's the only leader in its segment
N6{power.target = down, power.state: down(N5)}
N7{power.target = down, power.state: up} <- while target is down the node will never shutdown because it's the only node in its segment.

Notes

Nodes will keep discovering their neighbors but even with nodes powering down the leaders are mostly never change because they will keep having the lowest IDs.
I really think the farmer should has the power to set a node target to UP (but never down) to be able to fix a situation where a node is moved to another network or if the leader node is broken forever.

brandonpille commented 1 year ago

I have a couple of questions:

So the target is the state that the node should be moved into and the state is the actual state it is currently in?
When and how will the current state be modified to the target state?
When do we accept deployment contracts?
When do we start billing? From the moment it is up or from the moment we create the capacity reservation contract?

muhamadazmy commented 1 year ago

@brandonpille

Yes
That's up to the nodes to do. via an extrensic that is only called by the node. It might decide not to go into the target state at all if there is no way to bring it up again.
Everything regarding depolyment contracts stays as we discussed before (with capacity reservation and such) hence the grid can decide to set the target state to UP any moment.
Billing is done on reserved capacity

This does not change the contract reservation and billing cycle. this is solely related to node power cycle. Nothing much changes in the grid except for the "target" and "current" state. and the function to set the current state by the node

DylanVerstraete commented 1 year ago

@muhamadazmy I think Brandon asks if the billing should trigger even if the node is still down (if it for some reason could not be brought up)

brandonpille commented 1 year ago

Yes I meant do we bill even if the node is still down or whenever the change state extrinsic is called?
Do we accept creating the deployment contract if the node is still down (aka the extrinsic was not yet called to set the state of the node on UP)?

muhamadazmy commented 1 year ago

billing is related to capacity reservation. which should not exist unless a node target power is up. If a node "current" power is never got to "up" state means something is wrong. and billing probably need to stop may be

muhamadazmy commented 1 year ago

@brandonpille I think yes, the grid should accept creaation of capacity reservation as long as the node target state is up. Normally the current state should follow in few minutes (until the node actually is booted). May be during this time billing should not be done ?

DylanVerstraete commented 1 year ago

@muhamadazmy how are these segments defined?

brandonpille commented 1 year ago

billing is related to capacity reservation. which should not exist unless a node target power is up. If a node "current" power is never got to "up" state means something is wrong. and billing probably need to stop may be

So we only start billing if the power is set to UP?

DylanVerstraete commented 1 year ago

Billing will only trigger 1 hour after creation so it doesn't matter, if the node is still down by then, something is wrong.

brandonpille commented 1 year ago

booted

It would only be fair in my opinion to the user to only start billing when the node is actually UP. One more question. What do we do if the trigger UP event never comes? Do we add a timeout on it?

brandonpille commented 1 year ago

@brandonpille I think yes, the grid should accept creaation of capacity reservation as long as the node target state is up. Normally the current state should follow in few minutes (until the node actually is booted). May be during this time billing should not be done ?

I was talking about the deployment contract. Do we accept it whenever the capacity reservation is created, no matter the state of the node or when the node got UP?

muhamadazmy commented 1 year ago

I think yes. until there is a good reason not to.

despiegk commented 1 year ago

Only nodes that support TPM will be able to be powered off.

What's the thinking behind this requirement @delandtj?

This was a mistake, TPM has nothing to do with WOL I believe

Nelson361 commented 1 year ago

Those us us with a home datacenter will not be able to tolerate servers randomly starting up throughout the night. Nothing is louder than a server during startup. Please do checkups ONLY during daylight hours. Obviously startups can occur for deployments at any time, that is ok.

xmonader commented 1 year ago

could on chain should be finished by 23-11, need couple more days on zos to integrate it, will start the clients updates as soon as possible

xmonader commented 1 year ago

falling behind: requires more reworking https://github.com/threefoldtech/tfchain/issues/536

deadline will be updated after the engineering call today

scottyeager commented 1 year ago

Do we have an updated timeline, @xmonader?

xmonader commented 1 year ago

Do we have an updated timeline, @xmonader?

For chain deployment on devnet we are aiming to happen next tuesday, most of the clients are almost code complete, but they need to be tested against real environments

despiegk commented 1 year ago

close all linked issues we need new power mgmt story

threefoldtech / home

Power management (NEEDS TO BE DELETED) #1303

issues

Ability to poweroff nodes in the Grid

Preamble