farmerbot for power management

despiegk commented 1 year ago

features version 1

farmers can define their configuration in an easy markdown format
through WOL machines can be turned on and off
the provisioning scripts need to be adjusted to wake a node when required
the farming manager checks which nodes can be turned off

technical requirements

use vlang and the rmb components built into crystallib
use actor based paradigm (so its very easy to extend)
all configuration done as in https://github.com/threefoldtech/farmerbot/tree/development/example_data
support routeros switches to check if node is alive or down

generic requirements phase 2

defensive mode, we check wol works with multiple interfaces e.g. network switch, ping, ...
- when doing wol on/off talk to supported switch to see if network is active and also do ping to see node comes back or are down
its easy for farmers with scripting experience to expand the tool to support other PDU's (power mgmt devices)
we support racktivity devices for power measurement and power on/off

DylanVerstraete commented 1 year ago

the provisioning scripts need to be adjusted to wake a node when required

I don't think this is the right approach for this project. The deployment tools should contact the farmerbot over the relay and the farmerbot should reply with a specific nodeID from it's farm. This way the bot can do internal capacity planning and then a user cannot brute force all nodes in a farm to wake up.

It also prevent waiting times for users who want to deploy on a farm which has a farmerbot.

DylanVerstraete commented 1 year ago

What are "routeros switches" ? The farmer bot can simply keep in state which nodes are up and down.

When the farmerbot starts, it needs to check which nodes can go down and which nodes need to stay up. There are 2 ways we can approach this:

First we need to know which nodes are attached to to the farm, this can be done by querying graphql or gridproxy.

Then

Query every node in the farm via the relay and check if they host deployments, if not, send message to shut down
Query the chain for active contracts on a node (this is not supported in V) but maybe the gridproxy can be queried.

brandonpille commented 1 year ago

From a call with Kristof this morning: The farmerbot should use actor based paradigm. This means there should be actors who can execute tasks (or jobs) on behalf of other actors. The communication between actors should happen through messages. Doing that allows us to implement a specific actor in any language we want. It also minimizes the work we have to do in case we change the implementation of a piece of code. Coming back to the farmerbot. The farmerbot will be an actor communicating with other actors. It will receive messages from the JavaScript client asking for a node on which the deployment can happen. So the task to execute will be "Find me a node that can fit these resources". When executing that task it should power on or shutdown nodes if needed.

Here are the first steps I will start working on:

Define the structure of the messages that the JavaScript client will send (and also receive)
Look into WOL from V (@muhamadazmy might know more)
Implement key-value store in NATS
Checking the state of the nodes and shutdown the nodes when possible

Things that are clear:

The JavaScript client should use the farmerbot if there is one alive (@xmonader how are we going to do that?)
The farmerbot will get requests to find space on a node which it will send back a response too
The farmerbot will allow the owner to bring nodes up when he wants to (comment in issue 1371)
The farmerbot is optional: we should not assume its existence except for asking it to find a node where to deploy on
We can fetch the information about the nodes every 5 minutes and keep that information in memory for now or save it in a key value store in NATS

Things that are not clear yet (feel free to add your thoughts):

How do we get information about the nodes (same as @DylanVerstraete asked in the question above)
What if the farmebot shut down some nodes at it then crashes. Do we expect the farmer to bring them back up as fast as possible? How do we handle this?
What are routeros switches? Any existing code yet regarding it?
How do we make sure that only the farmer can bring nodes up? By executing that task only if it was submitted to the farmerbot through commandline/redis-queue itself?

@despiegk please tell me if I misunderstood something?

LeeSmet commented 1 year ago

One immediate problem is that for WOL, you need to have access to the host network. AFAIK, the farmerbot will be deployed in a container (actually a VM), and should thus be isolated from this network. Hence the farmerbot itself can't send the WOL packet.

To workaround this, the code to send an actuall WOL packet will need to be part of zos, and the farmerbot will then need to instruct a live zos node (can be any node in the farm, possibly the one it is running on, although the farmerbot could run on a node on a different farm or even outside of the grid) to send the packet to the target node. Note that the mac address of the target (and it's internal IP) is registered on the chain (actually this is the zos mac/IP but since that is a bridge with the physical NIC attached it will be the same (we set it the same)).

LeeSmet commented 1 year ago

How do we make sure that only the farmer can bring nodes up? By executing that task only if it was submitted to the farmerbot through commandline/redis-queue itself?

In the past we assumed the user has the farmer twin, which basically boils down to "the user has the seed". This is still the case (since the seed is injected). So for now I'd say we verify that messages to the farmer bot are send by the own identity (sender.twin_id == self.twin_id). In the future we can look into expanding the system by adding some type of access lists.

DylanVerstraete commented 1 year ago

I would suggest that the farmer runs the farmerbot on a specific node in it's farm. This node becomes a "delegator" in the network, every message to turn of / on a node can be sent to this node and this node passes it to the target node.

This way we also ensure that that the "delegator" node will never be shutdown because the farmerbot is running on there and uses capacity, hence will never be triggered to shut down.

brandonpille commented 1 year ago

Proposed strategy for the farmerbot

Shutting down nodes

The farmer will have to configure his nodes in a markdown file (telling the farmerbot how to connect to them, etc)
The farmerbot should contact ZOS nodes through RMB (ask for used resources etc)
Based on the info it will shut down nodes if needed. It will change the powertarget of the node on tfchain which will emit an event that the node will catch and handle upon.
Maybe keep in memory that the node is shutting down

TSClient jobs

When the TSClient wants to deploy on a node it will create a job configuration and send it to the farmerbot
The TSClient will send the job to the farmerbot via RMB

Handeling jobs (aka bringing nodes up)

The jobs will tell the farmerbot that a client wants to deploy a job on a node
It will bring up the node if it is offline: it will change the powertarget of the node on tfchain which will emit an event that the other nodes in the farm will catch. They will then send the WOL packet to that node.

brandonpille commented 1 year ago

The farmerbot will be an actor that can execute jobs. Those jobs will be send by the TSClient. There will only one job at the beginning: finding a node to deploy on. All jobs the format shown below (defined in baobab: https://github.com/freeflowuniverse/baobab/blob/main/baobab/jobs/model_json.v) and will be converted to json when sending via RMB.

pub struct ActionJobPublic {
pub mut:
    guid         string     
    twinid       u32        //twinid of the farmerbot
    action       string     //farmerbot.*
    args         params.Params
    result       params.Params
    state        string
    start        i64        //epoch
    end          i64        //epoch
    grace_period u32        //wait till next run, in seconds
    error        string     //string description of what went wrong
    timeout      u32        //time in seconds, 2h is maximum
    src_twinid   u32        //which twin was sending the job, 0 if local
    src_action   string     //unique actor path, runs on top of twin
    dependencies []string   //list of guids we need to wait on
}

The action attribute should have the format "domain.actor.method" and for our action it will be "farmerbot.nodemanager.findnode". The args attribute should be:

pub struct FindNode {
    // if provided we want a node that has at least these resources free (optional)
        required_hru u64 = 0
        required_sru u64 = 0
        required_cru u64 = 0
        required_mru u64 = 0
    // if not empty we want nodes that are not in this group (optional)
    node_exclude []u32 = []
    // if true we want a full node (optional)
    dedicated bool = false
    // give us a node with publicip (optional)
    public_config bool = false
    // give us a node and reserve that many public ips from the farm
    public_ips u32 = 0
    // give us a certified node (optional)
    certified bool = false
}

A-Harby commented 1 year ago

Verified. Using the instruction provided https://www2.manual.grid.tf/farmerbot/farmerbot.html, farm can easily add famerbot to his farm and config it as he likes and farms can switches nodes on/off when needing depending on threshold.

threefoldtecharchive / farmerbot