Closed despiegk closed 1 year ago
the provisioning scripts need to be adjusted to wake a node when required
I don't think this is the right approach for this project. The deployment tools should contact the farmerbot over the relay and the farmerbot should reply with a specific nodeID from it's farm. This way the bot can do internal capacity planning and then a user cannot brute force all nodes in a farm to wake up.
It also prevent waiting times for users who want to deploy on a farm which has a farmerbot.
What are "routeros switches" ? The farmer bot can simply keep in state which nodes are up and down.
When the farmerbot starts, it needs to check which nodes can go down and which nodes need to stay up. There are 2 ways we can approach this:
First we need to know which nodes are attached to to the farm, this can be done by querying graphql or gridproxy.
Then
From a call with Kristof this morning: The farmerbot should use actor based paradigm. This means there should be actors who can execute tasks (or jobs) on behalf of other actors. The communication between actors should happen through messages. Doing that allows us to implement a specific actor in any language we want. It also minimizes the work we have to do in case we change the implementation of a piece of code. Coming back to the farmerbot. The farmerbot will be an actor communicating with other actors. It will receive messages from the JavaScript client asking for a node on which the deployment can happen. So the task to execute will be "Find me a node that can fit these resources". When executing that task it should power on or shutdown nodes if needed.
Here are the first steps I will start working on:
Things that are clear:
Things that are not clear yet (feel free to add your thoughts):
@despiegk please tell me if I misunderstood something?
One immediate problem is that for WOL, you need to have access to the host network. AFAIK, the farmerbot will be deployed in a container (actually a VM), and should thus be isolated from this network. Hence the farmerbot itself can't send the WOL packet.
To workaround this, the code to send an actuall WOL packet will need to be part of zos, and the farmerbot will then need to instruct a live zos node (can be any node in the farm, possibly the one it is running on, although the farmerbot could run on a node on a different farm or even outside of the grid) to send the packet to the target node. Note that the mac address of the target (and it's internal IP) is registered on the chain (actually this is the zos mac/IP but since that is a bridge with the physical NIC attached it will be the same (we set it the same)).
How do we make sure that only the farmer can bring nodes up? By executing that task only if it was submitted to the farmerbot through commandline/redis-queue itself?
In the past we assumed the user has the farmer twin, which basically boils down to "the user has the seed". This is still the case (since the seed is injected). So for now I'd say we verify that messages to the farmer bot are send by the own identity (sender.twin_id == self.twin_id). In the future we can look into expanding the system by adding some type of access lists.
I would suggest that the farmer runs the farmerbot on a specific node in it's farm. This node becomes a "delegator" in the network, every message to turn of / on a node can be sent to this node and this node passes it to the target node.
This way we also ensure that that the "delegator" node will never be shutdown because the farmerbot is running on there and uses capacity, hence will never be triggered to shut down.
Shutting down nodes
TSClient jobs
Handeling jobs (aka bringing nodes up)
The farmerbot will be an actor that can execute jobs. Those jobs will be send by the TSClient. There will only one job at the beginning: finding a node to deploy on. All jobs the format shown below (defined in baobab: https://github.com/freeflowuniverse/baobab/blob/main/baobab/jobs/model_json.v) and will be converted to json when sending via RMB.
pub struct ActionJobPublic {
pub mut:
guid string
twinid u32 //twinid of the farmerbot
action string //farmerbot.*
args params.Params
result params.Params
state string
start i64 //epoch
end i64 //epoch
grace_period u32 //wait till next run, in seconds
error string //string description of what went wrong
timeout u32 //time in seconds, 2h is maximum
src_twinid u32 //which twin was sending the job, 0 if local
src_action string //unique actor path, runs on top of twin
dependencies []string //list of guids we need to wait on
}
The action attribute should have the format "domain.actor.method" and for our action it will be "farmerbot.nodemanager.findnode". The args attribute should be:
pub struct FindNode {
// if provided we want a node that has at least these resources free (optional)
required_hru u64 = 0
required_sru u64 = 0
required_cru u64 = 0
required_mru u64 = 0
// if not empty we want nodes that are not in this group (optional)
node_exclude []u32 = []
// if true we want a full node (optional)
dedicated bool = false
// give us a node with publicip (optional)
public_config bool = false
// give us a node and reserve that many public ips from the farm
public_ips u32 = 0
// give us a certified node (optional)
certified bool = false
}
Verified. Using the instruction provided https://www2.manual.grid.tf/farmerbot/farmerbot.html, farm can easily add famerbot to his farm and config it as he likes and farms can switches nodes on/off when needing depending on threshold.
features version 1
technical requirements
generic requirements phase 2