Failing download. - Githubissues

dgarner-cg commented 1 year ago

I have attempted everything to resolve this issue .. for over a week or more. It's getting frustrating. I've attempted this with the newest version of everything involved (Ansible, Kubespray, etc..) in standard OS and alternatively I've attempted this in a venv with requirements.txt versions of everything. I've attempted to eliminate all troubleshooting options possible before posting and it always comes down to the same Download section. I'm seeing a lot of info I haven't seen before with this inventory but I have to run to Rx before it closes and want to post immediately as it's already been an issue for actually more like 3 weeks, just the last week I have consistently focused on it.

fyi..

• pve-cos-pri: Outside server not involved in k8s cluster, this can be considered the primary server of the network. • pve-k8s-... obviously cluster • Primary network subnet: 10.0.0.0/24 • "DCHP" slots for k8s: 10.0.0.82 - 88 • dnssubdomain is separate for k8s cluster and is on the mgmt.sub.domain.tld portion. • All machines have good dhcp/dns/resolve curl ifconfig.me properly..

Can't think of much else outside of the process it could be / to run through.. now onto the other stuff and I'll be back later.

Thanks guys,

Environment: Local Baremetal Proxmox, Dual Socket Xeon Gold 6148 80 Core with 256 GB RAM.

OS: Control Server: Debian GNU/Linux 12 (Bookworm) Linux 6.1.0-13-amd64 x86_64

7 Node K8s Cluster, all the same.

Version of Ansible (ansible --version):
2.14.11
Version of Python (python --version):
Python 3.11.2

Kubespray version (commit) (git rev-parse --short HEAD): 22f58a5

Network plugin used: Calico

Full inventory with variables:

https://gist.github.com/dgarner-cg/c5ea336fdc78b369145cf52cd075dfee

Command used to invoke ansible:

ansible-playbook \ -i inventory/k8-mg/hosts.yaml \ --private-key=~/.ssh/id_rsa \ -u root \ --become \ cluster.yml

Output of ansible run:

https://gist.github.com/dgarner-cg/3f57fe502a970ead3529ac7fd836b043

Anything else do we need to know: I would look into why this is throwing as this may be another issue, but I've got to run out before Rx closes rq..

https://gist.github.com/dgarner-cg/d055057c89634705e8366b14208c5223

dgarner-cg commented 1 year ago

This is insanely frustrating.

I've added the following to /roles/.../download_files.yml

- name: Download_file | Download item
  block:
    - name: Download file
      get_url:
        url: "{{ valid_mirror_urls | random }}"
        dest: "{{ file_path_cached if download_force_cache else download.dest }}"
        owner: "{{ omit if download_localhost else (download.owner | default(omit)) }}"
        mode: "{{ omit if download_localhost else (download.mode | default(omit)) }}"
        checksum: "{{ 'sha256:' + download.sha256 if download.sha256 else omit }}"
        validate_certs: "{{ download_validate_certs }}"
        url_username: "{{ download.username | default(omit) }}"
        url_password: "{{ download.password | default(omit) }}"
        force_basic_auth: "{{ download.force_basic_auth | default(omit) }}"
        timeout: "{{ download.timeout | default(omit) }}"
        delegate_to: "{{ download_delegate if download_force_cache else inventory_hostname }}"
        run_once: "{{ download_force_cache }}"
        register: get_url_result
        become: "{{ not download_localhost }}"
        environment: "{{ proxy_env }}"
        no_log: "{{ not (unsafe_show_logs | bool) }}"

    - name: Handle Download Errors
      fail:
        msg: "Download failed: {{ get_url_result.msg }}"
      when: get_url_result.failed

  rescue:
- name: Retry on failure
  debug:
    msg: "Retrying download..."
  register: retry_debug_result
  until: "'OK' in get_url_result.msg or 'file already exists' in get_url_result.msg"
  retries: "{{ download_retries }}"
  delay: "{{ retry_stagger | default(5) }}"
  when: retry_debug_result is not defined or retry_debug_result.failed

  always:
    - name: Print Results
      debug:
        var: get_url_result

And below is further output ...

https://gist.github.com/dgarner-cg/064541f36bbac6b3ea49590f759989b0

dgarner-cg commented 12 months ago

Bro, same ish on all Ubuntu systems.. what the f.

FaraSys commented 12 months ago

Hi. I have the same issue on Ubuntu 22.04 LTS Kubespray Release 2.23.1

dgarner-cg commented 12 months ago

I am .. making progress, I have literally been working on this for a week.

arusa commented 11 months ago

I experience very similar problems. DNS problems coming up all the time. Currently I'm trying to add a new node and cp-node using cluster.yml and scale.yml and it always results in servers not being able to download stuff because kubespray updated their /etc/systemd/resolved.conf to resolve using coredns, but they don't have access to coredns yet :(

I'm very happy that I'm only running a test cluster.

dgarner-cg commented 11 months ago

Thanks for your feedback, I want to say that all the nodes are reachable via valid DNS but I will check.. I know my outside-of-cluster installer controller and 2 k8s-controller nodes all have valid DNS from here to Google, but I also had no ide about the CoreDNS issue either ..

I am looking as I have time to ensure Cilium is used across all files and use a local repo, but work has picked up going into the Holiday, just got off a 7 straight week, 24/7 on call stretch. :D

I will take a look at this again in a few moments and hope to knock it out.

arusa commented 11 months ago

I finally managed to add the new node. When I saw in the ansible output, that it just updated the /etc/systemd/resolved.conf file, I quickly opened it up on the new node and changed the line:

DNS=10.233.0...

to

DNS=1.1.1.1

and ran:

systemctl restart systemd-resolved.service

This way the node managed to finish all downloads executed by ansible. And in the end the resolved.conf was already changed back to use the coredns service as a resolver.

Btw. I also had to set enable_nodelocaldns to false yesterday, because I had a similar resolving problem while rolling out some changes using kubespray. At one point the nodes couldn't resolve anything because the nodelocaldns iptables rules probably weren't ready.

So DNS feels generally very fragile with Kubespray.

marvin0815 commented 11 months ago

No idea if it's new but GitHub now gives a 401 Forbidden for me when validating mirrors in Kubespray

See: curl -vJL -X HEAD https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.24.0/crictl-v1.24.0-linux-amd64.tar.gz

greenship24 commented 11 months ago

I think I'm suffering from the same. Have been at it for hours and hours.

Some logs when I removed no_log: true-

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TypeError: HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file'
failed: [fcos-test20-w2] (item=https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.amd64) => {"ansible_loop_var": "mirror", "attempts": 4, "changed": false, "elapsed": 0, "mirror": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.amd64", "msg": "Status code was -1 and not [200]: An unknown error occurred: HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file'", "redirected": false, "status": -1, "url": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.amd64"}

As a follow on- it does seem pretty well isolated somehow to github (from what I can see.) Could also be DNS but I downloaded the files specified (runc, crictl) that are coming from github and served them up real quick with python -m http.server on my ansible host. I edited roles/download/defaults/main/main.yml to point to my host and it seemed to work without issue. The other files (ie. crio, etc.) that are on storage.googleapis.com and other seem to resolve fine for me.

I'll keep plugging away when I have time.

Similar issue here- #10571

marvin0815 commented 11 months ago

I get a 200 again today. Sill modified the download role to check with GET instead of HEAD to deploy.

diff --git a/roles/download/tasks/download_file.yml b/roles/download/tasks/download_file.yml
index 376a15e8a..88f83c8cb 100644
--- a/roles/download/tasks/download_file.yml
+++ b/roles/download/tasks/download_file.yml
@@ -55,7 +55,7 @@
   - name: download_file | Validate mirrors
     uri:
       url: "{{ mirror }}"
-      method: HEAD
+      method: GET
       validate_certs: "{{ download_validate_certs }}"
       url_username: "{{ download.username | default(omit) }}"
       url_password: "{{ download.password | default(omit) }}"

Just in case it's a random bug in GitHub's cache system or something.

mdbudnick commented 10 months ago

It looks like I am having a similar issue and this is my output when running with the block: and outputting get_url_result:

ok: [workernode-3] => {
    "get_url_result": {
        "attempts": 4,
        "changed": false,
        "checksum_dest": null,
        "checksum_src": "d11d2f438da1892c8b1bdfc638ddb6764dbd0e2c",
        "dest": "/tmp/releases/runc-v1.1.9.arm64",
        "elapsed": 0,
        "failed": true,
        "msg": "Destination /tmp/releases does not exist",
        "src": "/home/mb/.ansible/tmp/ansible-tmp-1702614795.2398012-24550-25178834028643/tmpr_hbccf9",
        "url": "https://github.com/opencontainers/runc/releases/download/v1.1.9/runc.arm64"
    }
}

Please note: "Destination /tmp/releases does not exist" is not the issue as it fails with the same msg after adding an explicit file task before to create the directory.

Edit There is no checksum issue

I will try v2.22.1 and other versions and investigate the difference if I get it to work.

mdbudnick commented 10 months ago

Nevermind me, TIL --check has major limitations. This is my first Ansible playbook outside of tutorials, in my defense.

VannTen commented 9 months ago

Kubespray version (commit) (git rev-parse --short HEAD): 22f58a5

I can't find this commit in the repository. From your gist

{{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'. {{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'\n\nThe error appears to be in '/etc/ansible/usr-playbooks/cg-k8-ctrl/roles/download/tasks/download_file.yml': line 10, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Download_file | Starting download of file\n ^ here\n"}

Looks like you tried to use 1.28 on unsupported versions. I'm going to close this, feel free to reopen if you actually still encounter a bug /close

k8s-ci-robot commented 9 months ago

@VannTen: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/kubespray/issues/10592#issuecomment-1893984306): >> **Kubespray version (commit) (`git rev-parse --short HEAD`):** >> 22f58a5 > >I can't find this commit in the repository. >From your gist >``` >{{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'. {{ etcd_supported_versions[kube_major_version] }}: 'dict object' has no attribute 'v1.28'. 'dict object' has no attribute 'v1.28'\n\nThe error appears to be in '/etc/ansible/usr-playbooks/cg-k8-ctrl/roles/download/tasks/download_file.yml': line 10, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Download_file | Starting download of file\n ^ here\n"} >``` > >Looks like you tried to use 1.28 on unsupported versions. >I'm going to close this, feel free to reopen if you actually still encounter a bug >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / kubespray

Failing download. #10592