k3s-io / k3s-ansible

Apache License 2.0
2.09k stars 824 forks source link

Generate token #375

Closed anon-software closed 2 weeks ago

anon-software commented 2 weeks ago

If a token is not explicitly provided, let the first server generate a random one. Such a token is saved on the first server and the playbook can retrieve it from there and store it a a fact. All other servers and agents can use that token later to join the cluster. It will be saved into their environment file as usual.

I tested this by creating a cluster of one server and then adding two more servers and one agent. Please let me know if I should try some other tests as well.

Changes

Linked Issues

https://github.com/k3s-io/k3s-ansible/issues/307

dereknola commented 2 weeks ago

This has been tried before. You need to test the case of 3 servers all at the start. There are issues with the fact correctly being propagated to the other two servers, as the bring up on all the nodes happens asynchronously, so you cannot guarantee (and when I tried to implemented this it failed) that the fact will even exist for the other 2 servers to see and join with.

anon-software commented 2 weeks ago

Maybe I do not understand how Ansible works then. Doesn't the task "Init first server node" from roles/k3s_server/tasks/main.yml runs first and terminates before "Start other server if any and verify status" runs? The former task will save the token and it will always be available for the others being setup in the latter, or at least that is how thought it would work.

Setting up three servers all at once was actually the first test I ran, although it is possible that it was a fluke that it worked.

dereknola commented 2 weeks ago

If you got it working, that great! I'm gonna pull down your PR and check it out sometime later today or Monday.

dereknola commented 2 weeks ago

the CNCF requires that all commits be signed. Just follow the instructions https://github.com/k3s-io/k3s-ansible/pull/375/checks?check_run_id=32818692853

dereknola commented 2 weeks ago

When testing with the vagrant file, I see the following error

TASK [k3s_agent : Get the token from the first server] *************************
fatal: [agent-0]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'token'\n\nThe error appears to be in '/home/derek/rancher/ansible-k3s/roles/k3s_agent/tasks/main.yml': line 38, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Get the token from the first server\n  ^ here\n"}

Its possible that the vagrant ansible provisioner works differently than a regular ansible-playbook deployment. I'm testing with my local pi cluster. Will Update.

dereknola commented 2 weeks ago

So interesting results. For the 3 pi cluster, the first time I tested with 3 servers, it installed fine. Then ran the reset playbook and tried to run the site.yaml again. This time it also failed with

TASK [k3s_server : Get the token from the first server] *******************************************************************************************************************************
fatal: [192.168.1.91]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'token'\n\nThe error appears to be in '/home/derek/rancher/ansible-k3s/roles/k3s_server/tasks/main.yml': line 208, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  block:\n    - name: Get the token from the first server\n      ^ here\n"}
fatal: [192.168.1.92]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'token'\n\nThe error appears to be in '/home/derek/rancher/ansible-k3s/roles/k3s_server/tasks/main.yml': line 208, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  block:\n    - name: Get the token from the first server\n      ^ here\n"}

PLAY RECAP ****************************************************************************************************************************************************************************
192.168.1.90               : ok=21   changed=3    unreachable=0    failed=1    skipped=45   rescued=0    ignored=1   
192.168.1.91               : ok=21   changed=3    unreachable=0    failed=1    skipped=61   rescued=0    ignored=1   
192.168.1.92               : ok=21   changed=3    unreachable=0    failed=1    skipped=60   rescued=0    ignored=1   

This is the exact same issue I ran into the first time I attempted to implement auto generating tokens.

dereknola commented 2 weeks ago

Run a server + agent inventory on the raspberry pi cluster, the playbook works because those are seperate roles, so they run sequentially (ie the server role gets executed, then the agent role). But for the vagarant provisioner, it just runs everything in parallel, so this system will never work.

I'm less concerned if the Vagrantfile works, that can just be notes in the Vagrantfile as "requires token". But the above errors around regular ssh nodes is a blocker on this PR.

You might want to look into https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html#restricting-execution-with-throttle or other ways to control execution on nodes. Its possible there is some way of achieving if no token exists: run the next task throttled/sequential to ensure the other nodes can find the token var

anon-software commented 2 weeks ago

Do you still have the complete log of the playbook execution that you can attach here?

dereknola commented 2 weeks ago

cluster.log

dereknola commented 2 weeks ago

Okay nvmd I just read my own error logs. Let me fix it.

dereknola commented 2 weeks ago

I seem to have found a seperate issue around Copy K3s service file needing extra_server_args to be defined. I had stripped down my inventory.yaml to be super simple. I will open a seperate PR to address this issue.

anon-software commented 2 weeks ago

OK, I shall push another commit to address the new batch of Lint errors.