Closed dgarner-cg closed 9 months ago
This is insanely frustrating.
I've added the following to /roles/.../download_files.yml
- name: Download_file | Download item
block:
- name: Download file
get_url:
url: "{{ valid_mirror_urls | random }}"
dest: "{{ file_path_cached if download_force_cache else download.dest }}"
owner: "{{ omit if download_localhost else (download.owner | default(omit)) }}"
mode: "{{ omit if download_localhost else (download.mode | default(omit)) }}"
checksum: "{{ 'sha256:' + download.sha256 if download.sha256 else omit }}"
validate_certs: "{{ download_validate_certs }}"
url_username: "{{ download.username | default(omit) }}"
url_password: "{{ download.password | default(omit) }}"
force_basic_auth: "{{ download.force_basic_auth | default(omit) }}"
timeout: "{{ download.timeout | default(omit) }}"
delegate_to: "{{ download_delegate if download_force_cache else inventory_hostname }}"
run_once: "{{ download_force_cache }}"
register: get_url_result
become: "{{ not download_localhost }}"
environment: "{{ proxy_env }}"
no_log: "{{ not (unsafe_show_logs | bool) }}"
- name: Handle Download Errors
fail:
msg: "Download failed: {{ get_url_result.msg }}"
when: get_url_result.failed
rescue:
- name: Retry on failure
debug:
msg: "Retrying download..."
register: retry_debug_result
until: "'OK' in get_url_result.msg or 'file already exists' in get_url_result.msg"
retries: "{{ download_retries }}"
delay: "{{ retry_stagger | default(5) }}"
when: retry_debug_result is not defined or retry_debug_result.failed
always:
- name: Print Results
debug:
var: get_url_result
And below is further output ...
https://gist.github.com/dgarner-cg/064541f36bbac6b3ea49590f759989b0
Bro, same ish on all Ubuntu systems.. what the f.
Hi. I have the same issue on Ubuntu 22.04 LTS Kubespray Release 2.23.1
I am .. making progress, I have literally been working on this for a week.
I experience very similar problems. DNS problems coming up all the time. Currently I'm trying to add a new node and cp-node using cluster.yml and scale.yml and it always results in servers not being able to download stuff because kubespray updated their /etc/systemd/resolved.conf to resolve using coredns, but they don't have access to coredns yet :(
I'm very happy that I'm only running a test cluster.
Thanks for your feedback, I want to say that all the nodes are reachable via valid DNS but I will check.. I know my outside-of-cluster installer controller and 2 k8s-controller nodes all have valid DNS from here to Google, but I also had no ide about the CoreDNS issue either ..
I am looking as I have time to ensure Cilium is used across all files and use a local repo, but work has picked up going into the Holiday, just got off a 7 straight week, 24/7 on call stretch. :D
I will take a look at this again in a few moments and hope to knock it out.
I finally managed to add the new node. When I saw in the ansible output, that it just updated the /etc/systemd/resolved.conf file, I quickly opened it up on the new node and changed the line:
DNS=10.233.0...
to
DNS=1.1.1.1
and ran:
systemctl restart systemd-resolved.service
This way the node managed to finish all downloads executed by ansible. And in the end the resolved.conf was already changed back to use the coredns service as a resolver.
Btw. I also had to set enable_nodelocaldns
to false
yesterday, because I had a similar resolving problem while rolling out some changes using kubespray. At one point the nodes couldn't resolve anything because the nodelocaldns iptables rules probably weren't ready.
So DNS feels generally very fragile with Kubespray.
No idea if it's new but GitHub now gives a 401 Forbidden for me when validating mirrors in Kubespray
See:
curl -vJL -X HEAD https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.24.0/crictl-v1.24.0-linux-amd64.tar.gz
I think I'm suffering from the same. Have been at it for hours and hours.
Some logs when I removed no_log: true-
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TypeError: HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file'
failed: [fcos-test20-w2] (item=https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.amd64) => {"ansible_loop_var": "mirror", "attempts": 4, "changed": false, "elapsed": 0, "mirror": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.amd64", "msg": "Status code was -1 and not [200]: An unknown error occurred: HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file'", "redirected": false, "status": -1, "url": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.amd64"}
As a follow on- it does seem pretty well isolated somehow to github (from what I can see.) Could also be DNS but I downloaded the files specified (runc, crictl) that are coming from github and served them up real quick with python -m http.server
on my ansible host. I edited roles/download/defaults/main/main.yml
to point to my host and it seemed to work without issue. The other files (ie. crio, etc.) that are on storage.googleapis.com and other seem to resolve fine for me.
I'll keep plugging away when I have time.
Similar issue here- #10571
I get a 200 again today. Sill modified the download role to check with GET instead of HEAD to deploy.
diff --git a/roles/download/tasks/download_file.yml b/roles/download/tasks/download_file.yml
index 376a15e8a..88f83c8cb 100644
--- a/roles/download/tasks/download_file.yml
+++ b/roles/download/tasks/download_file.yml
@@ -55,7 +55,7 @@
- name: download_file | Validate mirrors
uri:
url: "{{ mirror }}"
- method: HEAD
+ method: GET
validate_certs: "{{ download_validate_certs }}"
url_username: "{{ download.username | default(omit) }}"
url_password: "{{ download.password | default(omit) }}"
Just in case it's a random bug in GitHub's cache system or something.
It looks like I am having a similar issue and this is my output when running with the block:
and outputting get_url_result
:
ok: [workernode-3] => {
"get_url_result": {
"attempts": 4,
"changed": false,
"checksum_dest": null,
"checksum_src": "d11d2f438da1892c8b1bdfc638ddb6764dbd0e2c",
"dest": "/tmp/releases/runc-v1.1.9.arm64",
"elapsed": 0,
"failed": true,
"msg": "Destination /tmp/releases does not exist",
"src": "/home/mb/.ansible/tmp/ansible-tmp-1702614795.2398012-24550-25178834028643/tmpr_hbccf9",
"url": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.arm64"
}
}
Please note: "Destination /tmp/releases does not exist" is not the issue as it fails with the same msg after adding an explicit file task before to create the directory.
Edit There is no checksum issue
I will try v2.22.1 and other versions and investigate the difference if I get it to work.
Nevermind me, TIL --check
has major limitations. This is my first Ansible playbook outside of tutorials, in my defense.
Kubespray version (commit) (
git rev-parse --short HEAD
): 22f58a5
I can't find this commit in the repository. From your gist
{{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'. {{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'\n\nThe error appears to be in '/etc/ansible/usr-playbooks/cg-k8-ctrl/roles/download/tasks/download_file.yml': line 10, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Download_file | Starting download of file\n ^ here\n"}
Looks like you tried to use 1.28 on unsupported versions. I'm going to close this, feel free to reopen if you actually still encounter a bug /close
@VannTen: Closing this issue.
I have attempted everything to resolve this issue .. for over a week or more. It's getting frustrating. I've attempted this with the newest version of everything involved (Ansible, Kubespray, etc..) in standard OS and alternatively I've attempted this in a venv with requirements.txt versions of everything. I've attempted to eliminate all troubleshooting options possible before posting and it always comes down to the same Download section. I'm seeing a lot of info I haven't seen before with this inventory but I have to run to Rx before it closes and want to post immediately as it's already been an issue for actually more like 3 weeks, just the last week I have consistently focused on it.
fyi..
• pve-cos-pri: Outside server not involved in k8s cluster, this can be considered the primary server of the network. • pve-k8s-... obviously cluster • Primary network subnet: 10.0.0.0/24 • "DCHP" slots for k8s: 10.0.0.82 - 88 • dnssubdomain is separate for k8s cluster and is on the mgmt.sub.domain.tld portion. • All machines have good dhcp/dns/resolve
curl ifconfig.me
properly..Can't think of much else outside of the process it could be / to run through.. now onto the other stuff and I'll be back later.
Thanks guys,
Environment: Local Baremetal Proxmox, Dual Socket Xeon Gold 6148 80 Core with 256 GB RAM.
7 Node K8s Cluster, all the same.
Version of Ansible (
ansible --version
):2.14.11
Version of Python (
python --version
):Python 3.11.2
Kubespray version (commit) (
git rev-parse --short HEAD
): 22f58a5Network plugin used: Calico
Full inventory with variables:
https://gist.github.com/dgarner-cg/c5ea336fdc78b369145cf52cd075dfee
Command used to invoke ansible:
ansible-playbook \ -i inventory/k8-mg/hosts.yaml \ --private-key=~/.ssh/id_rsa \ -u root \ --become \ cluster.yml
Output of ansible run:
https://gist.github.com/dgarner-cg/3f57fe502a970ead3529ac7fd836b043
Anything else do we need to know: I would look into why this is throwing as this may be another issue, but I've got to run out before Rx closes rq..
https://gist.github.com/dgarner-cg/d055057c89634705e8366b14208c5223