Closed anon-software closed 2 weeks ago
This has been tried before. You need to test the case of 3 servers
all at the start. There are issues with the fact correctly being propagated to the other two servers, as the bring up on all the nodes happens asynchronously, so you cannot guarantee (and when I tried to implemented this it failed) that the fact will even exist for the other 2 servers to see and join with.
Maybe I do not understand how Ansible works then. Doesn't the task "Init first server node" from roles/k3s_server/tasks/main.yml runs first and terminates before "Start other server if any and verify status" runs? The former task will save the token and it will always be available for the others being setup in the latter, or at least that is how thought it would work.
Setting up three servers all at once was actually the first test I ran, although it is possible that it was a fluke that it worked.
If you got it working, that great! I'm gonna pull down your PR and check it out sometime later today or Monday.
the CNCF requires that all commits be signed. Just follow the instructions https://github.com/k3s-io/k3s-ansible/pull/375/checks?check_run_id=32818692853
When testing with the vagrant file, I see the following error
TASK [k3s_agent : Get the token from the first server] *************************
fatal: [agent-0]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'token'\n\nThe error appears to be in '/home/derek/rancher/ansible-k3s/roles/k3s_agent/tasks/main.yml': line 38, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Get the token from the first server\n ^ here\n"}
Its possible that the vagrant ansible provisioner works differently than a regular ansible-playbook
deployment. I'm testing with my local pi cluster. Will Update.
So interesting results. For the 3 pi cluster, the first time I tested with 3 servers, it installed fine. Then ran the reset
playbook and tried to run the site.yaml
again. This time it also failed with
TASK [k3s_server : Get the token from the first server] *******************************************************************************************************************************
fatal: [192.168.1.91]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'token'\n\nThe error appears to be in '/home/derek/rancher/ansible-k3s/roles/k3s_server/tasks/main.yml': line 208, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Get the token from the first server\n ^ here\n"}
fatal: [192.168.1.92]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'token'\n\nThe error appears to be in '/home/derek/rancher/ansible-k3s/roles/k3s_server/tasks/main.yml': line 208, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Get the token from the first server\n ^ here\n"}
PLAY RECAP ****************************************************************************************************************************************************************************
192.168.1.90 : ok=21 changed=3 unreachable=0 failed=1 skipped=45 rescued=0 ignored=1
192.168.1.91 : ok=21 changed=3 unreachable=0 failed=1 skipped=61 rescued=0 ignored=1
192.168.1.92 : ok=21 changed=3 unreachable=0 failed=1 skipped=60 rescued=0 ignored=1
This is the exact same issue I ran into the first time I attempted to implement auto generating tokens.
Run a server + agent inventory on the raspberry pi cluster, the playbook works because those are seperate roles, so they run sequentially (ie the server role gets executed, then the agent role). But for the vagarant provisioner, it just runs everything in parallel, so this system will never work.
I'm less concerned if the Vagrantfile works, that can just be notes in the Vagrantfile
as "requires token". But the above errors around regular ssh nodes is a blocker on this PR.
You might want to look into https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html#restricting-execution-with-throttle or other ways to control execution on nodes. Its possible there is some way of achieving if no token exists: run the next task throttled/sequential to ensure the other nodes can find the token var
Do you still have the complete log of the playbook execution that you can attach here?
Okay nvmd I just read my own error logs. Let me fix it.
I seem to have found a seperate issue around Copy K3s service file
needing extra_server_args
to be defined. I had stripped down my inventory.yaml to be super simple. I will open a seperate PR to address this issue.
OK, I shall push another commit to address the new batch of Lint errors.
If a token is not explicitly provided, let the first server generate a random one. Such a token is saved on the first server and the playbook can retrieve it from there and store it a a fact. All other servers and agents can use that token later to join the cluster. It will be saved into their environment file as usual.
I tested this by creating a cluster of one server and then adding two more servers and one agent. Please let me know if I should try some other tests as well.
Changes
Linked Issues
https://github.com/k3s-io/k3s-ansible/issues/307