Closed displague closed 2 years ago
The networking ranges and network location of the Hegel metadata service is user-configurable. In addition to this, network and hardware isolation is not guaranteed in Tinkerbell environments. It is therefore not possible to know or recommend that Hegel be configured for a specific address in all environments, public, private, or link-local.
The user must be able to define the addresses for the Hegel service. How is this address configured today? Where is that information stored? How does this address make its way into templates and workflows?
Tinkerbell does not provide DNS services. Can mDNS be used in Tinkerbell environments? Could Hegel then be addressed with a well-known name (configurable per cluster), such as "metadata.local"? What benefits would this provide, what are the limitations and criteria for this to be feasible?
Ignition (Afterburn) is currently Packet aware. Could this work be extended to support Tinkerbell? What are the key differences in the spec or access methods (and location)?
Two of the main benefits of cloud-init are network configuration and userdata retrieval.
Userdata would need to be attained through the metadata service.
Does Tinkerbell benefit from cloud-init for network discovery purposes? DHCP is currently provided. DHCP has the limitation of a single address per interface. Does Tinkerbell and Hegel currently provide the means to define network information more granularly than that, such that network information from the metadata service would be beneficial?
Cloud-init benefits from dsidentity detection of the environment through local means. This is typically done through DMI (dmidecode
). For a given environment, well known DMI device fields will be populated with platform identifiable patterns.
For example:
System Information
Manufacturer: Packet
Product Name: c3.small.x86
Version: R1.00
Serial Number: D5S0R8000047
UUID: 00000000-0000-0000-0000-d05099f0314c
Wake-up Type: Power Switch
SKU Number: To Be Filled By O.E.M.
Family: To Be Filled By O.E.M.
Can or should Tinkerbell express the opinion that DMI should be updated on each device? When would this happen in the enrolling or workflow process? What values would be used? Can a user opt-out of this? Is it technically possible to support this across unknown hardware (using common software)?
Is it possible to use the network at Layer2 for platform detection or to report the metadata address, through LLDP, perhaps? (@invidian)
Barring network and local hardware modifications, are we left with only kernel command-line arguments for identification (ds=tinkerbell
, for example)?
I've been able to get things working at a basic level by using sandbox/vagrant/libvirt, adding the link-local address 169.254.169.254/16 to the provisioner host, configuring user-data in the host definition, and injecting a datasource configuration into the host image using a workflow.
vagrant up provisioner --no-destroy-on-error
vagrant ssh provisioner
# workaround for https://github.com/tinkerbell/sandbox/issues/62
sudo curl -L "https://github.com/docker/compose/releases/download/1.26.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
exit
vagrant provision provisioner
pushd $IMAGE_BUILDER_DIR/images/capi # from https://github.com/kubernetes-sigs/image-builder/pull/547
make build-raw-all
cp output/ubuntu-1804-kube-v1.18.15.gz $SANDBOX_DIR/deploy/state/webroot/
popd
vagrant ssh provisioner
cd /vagrant && source .env && cd deploy
docker-compose up -d
# TODO: add 168.254.169.254 link-local address to provisioner machine
# TODO: figure out how we can incorporate this into sandbox
# TODO: will this cause issues in EM deployments?
# edit /etc/netplan.eth1.yaml
# add 169.254.169.254/16 to the addresses
# netplan apply
# setup hook as a replacement for OSIE (https://github.com/tinkerbell/hook#the-manual-way)
pushd /vagrant/deploy/state/webroot/misc/osie
mv current current-bak
mkdir current
wget http://s.gianarb.it/tinkie/tinkie-master.tar.gz
tar xzv -C ./current -f tinkie-master.tar.gz
popd
# TODO: follow up on not needing to pull/tag/push images to internal registry for actions
# TODO: requires changes to tink-worker to avoid internal registry use
docker pull quay.io/tinkerbell-actions/image2disk:v1.0.0
docker tag quay.io/tinkerbell-actions/image2disk:v1.0.0 192.168.1.1/image2disk:v1.0.0
docker push 192.168.1.1/image2disk:v1.0.0
docker pull quay.io/tinkerbell-actions/writefile:v1.0.0
docker tag quay.io/tinkerbell-actions/writefile:v1.0.0 192.168.1.1/writefile:v1.0.0
docker push 192.168.1.1/writefile:v1.0.0
docker pull quay.io/tinkerbell-actions/kexec:v1.0.0
docker tag quay.io/tinkerbell-actions/kexec:v1.0.0 192.168.1.1/kexec:v1.0.0
docker push 192.168.1.1/kexec:v1.0.0
# TODO: investigate hegel metadata not returning proper values for 2009-04-04/meta-data/{public,local}-ipv{4,6}, currently trying to return values from hw.metadata.instance.network.addresses[] instead of hw.network.interfaces[]
# TODO: should hegel (or tink) automatically populate fields from root sources, for example metadata.instance.id from id
# public/local ip addresses from network.addresses, etc?
# TODO: automatic hardware detection to avoid needing to manually populate metadata.instance.storage.disks[].device
cat > hardware-data-worker-1.json <<EOF
{
"id": "ce2e62ed-826f-4485-a39f-a82bb74338e2",
"metadata": {
"facility": {
"facility_code": "onprem"
},
"userdata": "#cloud-config\nssh_authorized_keys:\n- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCZaw/MNLTa1M93IbrpklSqm/AreHmLSauFvGJ1Q5OV5/pfyeusNoxDaOQlk3BzG3InmhWX4tk73GOBHO36ugpeorGg/fC4m+5rL42z2BND1o98Borb6x2pAGF11IcEM9m7c8k0gg9lP2OR4mDAq2BFrmJq8h77zk9LtpWEvFJfASx9iqv0s7uHdWjc3ERQ/fcgl8Lor/GYzSbvATO6StrwrLs/HusA5k9vDKyEGfGbxADMmxnnzaukqhuk8+SXf+Ni4kKReGkqjFI8uUeOLU/4sG5X5afTlW6+7KPZUhLSkZh6/bVY8m5B9AsV8M6yHEan48+258Q78lsu8lWhoscUYV49nyA61RveiBUExZYhi45jI3LUmGX3hHpVwfRMfgh0RjtrkCX8I6eSLCUX//Xu4WKkVMgQur2TLT+Nmpf4dwJgDX72nQmgbu/CHC4u2Y5FTWnHpeNLicOWecsHXxqs8U1K7rWguOfCiD/qtRhqp5Sz3m37/h/aGjGqvsa/DIc= detiber@loggerhead.local.detiberus.net",
"instance": {
"id": "ce2e62ed-826f-4485-a39f-a82bb74338e2",
"hostname": "test-instance",
"storage": {
"disks": [{"device": "/dev/vda"}]
}
},
"state": ""
},
"network": {
"interfaces": [
{
"dhcp": {
"arch": "x86_64",
"ip": {
"address": "192.168.1.5",
"gateway": "192.168.1.1",
"netmask": "255.255.255.248"
},
"mac": "08:00:27:00:00:01",
"uefi": false
},
"netboot": {
"allow_pxe": true,
"allow_workflow": true
}
}
]
}
}
EOF
docker exec -i deploy_tink-cli_1 tink hardware push < ./hardware-data-worker-1.json
cat > capi-stream-template.yml <<EOF
version: "0.1"
name: capi_provisioning
global_timeout: 6000
tasks:
- name: "os-installation"
worker: "{{.device_1}}"
volumes:
- /dev:/dev
- /dev/console:/dev/console
- /lib/firmware:/lib/firmware:ro
environment:
MIRROR_HOST: 192.168.1.1
actions:
- name: "stream-image"
image: image2disk:v1.0.0
timeout: 90
environment:
IMG_URL: http://192.168.1.1:8080/ubuntu-1804-kube-v1.18.15.gz
DEST_DISK: /dev/vda
COMPRESSED: true
- name: "add-tink-cloud-init-config"
image: writefile:v1.0.0
timeout: 90
environment:
DEST_DISK: /dev/vda1
FS_TYPE: ext4
DEST_PATH: /etc/cloud/cloud.cfg.d/10_tinkerbell.cfg
UID: 0
GID: 0
MODE: 0600
DIRMODE: 0700
CONTENTS: |
datasource:
Ec2:
metadata_urls: ["http://192.168.1.1:50061", "http://169.254.169.254:50061"]
system_info:
default_user:
name: tink
groups: [wheel, adm]
sudo: ["ALL=(ALL) NOPASSWD:ALL"]
shell: /bin/bash
- name: "kexec-image"
image: kexec:v1.0.0
timeout: 90
pid: host
environment:
BLOCK_DEVICE: /dev/vda1
FS_TYPE: ext4
EOF
docker exec -i deploy_tink-cli_1 tink template create < ./capi-stream-template.yml
docker exec -i deploy_tink-cli_1 tink workflow create -t <TEMPLATE ID> -r '{"device_1":"08:00:27:00:00:01"}'
That's excellent, @detiber!
Do you suppose we can close this issue given this success or are we dependent on unreleased features? Are there any additional or supporting features to investigate? Do we need examples of other OSes taking advantage of this? (Ignition, Kickstart, other)? Should we include these steps in Tinkerbell documentation?
I definitely think we need to add some documentation, quite a bit of it isn't quite intuitive, such as:
@detiber Is the link local address really needed? shouldn't cloud-init just pull the metadata from 192.169.1.1:50061 because that's the ip listed in metadata_urls?
@cursedclock What address should the device use to access the metadata and how will that address be determined?
Link-local solves this problem with self-assigned addressed. It also suggests that the metadata should use a well-known address like 169.254.169.254 which cloud-init uses as the default for various ds=
values. hegel
provides basic ds=ec2
compatibility (2009-04-04) and use of this address will avoid the need for additional kernel command line arguments.
On the other hand, if we have to manipulate kernel command line arguments, we can likely provide the IP address in the same way.
This becomes more advantageous with direct Tinkerbell support in cloud-init.
@displague I see, that means that there would be no need for adding an action to modify the contents of /etc/cloud/cloud.cfg.d right? Since there the worker machine is expected to use the "default" address for pulling configuration metadata.
Kernel args ds=ec2;metadata_urls=http://ip:port
should work too. This is for cloud-init, kickstart/ignition take different arguments.
@nshalman has made a PoC to do cloud-init based installs. Can you comment if this issue can be closed?
Alas, that was a hack using a nocloud partition on disk. And the code that we used is not currently upstream. Issue is still valid and open.
Amazon is using Hegel to provision with cloud init and it's seemingly working. What makes us think this isn't working?
I'm verifying a few bits with cloud init manually so once I have that data I'll include it here.
Some further investigation in https://github.com/tinkerbell/hegel/issues/61#issuecomment-1120483426 found disparities that need fixing.
Perhaps this issue can be closed in favor of discussion over there about redesign?
We can close this. We can open another issue if there is more interest in introducing cloud-init support for ds=tinkerbell
(or hegel) as a unique flavor of metadata distinct from EC2 flavor.
This issue is an open investigation into what features Hegel (and perhaps other components of tinkerbell) would need in order to support cloud-init.
Cloud-init is a provisioning service that runs within many distributions, including Debian and Ubuntu. Cloud-init has awareness of various cloud providers (or provisioning environments, like OpenStack).
Cloud-init:
For raw disk support, with images composed of partitions (GPT, dos, etc), or LVM, or unknown or encrypted filesystems, the current approach of stamping a docker image based filesystem with a file is not sufficient. These raw disks must be pristine and trusted and can not be manipulated externally (by Tinkerbell) without disturbing trust.
Tinkerbell provisioned nodes should be able to rely on pre-installed software, such as Cloud-init or Ignition (Afterburn), and kernel arguments to access the metadata service provided by Hegel.
What changes to Hegel are required to provide this? What non-Tinkerbell / external changes are needed?
After some initial input and consideration, this issue should either be closed as a non-goal or result in one or several tinkerbell proposals to address any limitations, external cloud-init issues should also be raised.