threefoldtech / home

Starting point for the threefoldtech organization
https://threefold.io
Apache License 2.0
9 stars 4 forks source link

tfrobot: easy mass deployer for VMs #1504

Closed despiegk closed 4 months ago

despiegk commented 6 months ago

As a system administrator, I want to deploy a large number of virtual machines across different node groups with specific hardware and network configurations

Acceptance Criteria

Node Group Creation:

The system should allow the creation of node groups with specified attributes such as number of nodes, minimum cores, memory, SSD and HDD storage, IP availability (both IPv4 and IPv6), and region. Each node group should be able to specify if the machines are dedicated, certified, minimum nodes in the farm, and minimum nodes applying the same rules. The system should be able to handle a specified bandwidth for each node group.

Virtual Machine Deployment:

VMs should be deployable within the created node groups. Each VM specification should include the number of VMs, associated node group, CPU cores, memory, SSD capacity and mounting, HDD attachment, public IP (IPv4 and IPv6), file list, root size, and associated SSH key. There should be flexibility in VM deployment, such as specifying the number of VMs

SSH Key Management:

SSH keys should be manageable, allowing the specification of a key name and public key.

Output Specification:

Upon successful deployment, the system should output a YAML file with details of the deployed node groups and VMs, including their names and public IP addresses. In case of errors, the output file should include the details of the affected node group or VM and the corresponding error message.

Parallel Processing:

The deployment process should be optimized for parallel execution to ensure efficient and quick deployment across multiple node groups and VMs using Batch calls.

Error Handling:

The system should robustly handle errors, providing meaningful error messages in the output file for any issues encountered during the deployment process. Example Scenario A user inputs the YAML or JSON configuration for deploying several VMs across different regions with specific hardware and network requirements. The system processes this configuration, setting up the node groups and VMs as specified. Once the deployment is complete, the system outputs a YAML file containing details of each deployed VM and node group, including their public IP addresses. In case of any errors, the system provides detailed error messages to facilitate troubleshooting. The entire process runs efficiently in parallel to minimize deployment time.

example spec for mass deployment

- nodegroup:
    - name: 'group_a'
      #amount of nodes to be found
      nrnodes: 5
      #cores = logical core
      nrcores_min: 10
      #gb of memory
      mem_min: 32
      ssd_min: 2000
      hdd_min: 30000
      # full machine capacity available, can it be made dedicated
      dedicated: true
      pubip4: true
      pubip6: true
      # comma separate list of region's
      # list see: https://apps.who.int/gho/data/node.searo-metadata.UNREGION?lang=en
      region: "UN_Africa,UN_Eastern_Asia"
      certified: true
      #min nr of nodes in farm
      min_nodes_farm: 5
      #nr of nodes in same farm which apply the rules above
      min_nodes_apply_rules: 3
      #bandwidth which can be achieved, do we know this?
      min_bw: 100
- sshkey:
    - name: 'despiegk'
      pubkey: ''
- vms:
    - name: 'mymachine_${nr}'
      nrvm: 4
      nodegroup: 'group_a'
      #logical cores of machine
      nrcpu: 10
      #mb of memory
      mem: 2000
      #ssd capacity in GB
      ssd: 
        -   capacity: 200
            mount: '/mydata'
      # means all HD's are given raw to VM
      hdd_attached: true
      pubip4: true
      pubip6: true
      flist: ...
      rootsize: 200
      sshkey: 'despiegk'
    - name: 'mydb'
      nodegroup: 'group_a'
      # all capacity means we create vm having access to all SSD's, to all mem, to all cores, to all HDD's, SSD all means we create dirs with no limits, each on each SSD and give as /data/1... for each ssd
      all_capacity:true
      pubip4: true
      pubip6: true
      flist: ...
      sshkey: 'despiegk'
      rootsize: 100

can be given in yaml or json

the result is yaml or json and gives the following info

- ok:
    - name: 'group_a'
      pubip4: '333.333.333.333'
      pubip6: '...'
- error:
    - name: 'group_a'
    - msg: ''
despiegk commented 6 months ago

maybe need to add something with uptime requirement for node

despiegk commented 5 months ago

we need to be able to delete what we deployed as well