kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
16.2k stars 6.49k forks source link

Galaxy install missing files #11706

Closed bjetal closed 6 days ago

bjetal commented 1 week ago

What happened?

Using Kubespray v2.25.0 using a galaxy install from github (as part of a larger ansible process)

During installing Calico with the API server enabled, got error containing: Could not find or access 'openssl.conf'

Investigating showed that even though the file existed in the Git repository in the files directory of the network_plugins/calico role, it did not exist in the installed collection.

What did you expect to happen?

Successful install of Calico including API Server and the rest of Kubernetes

How can we reproduce it (as minimally and precisely as possible)?

Set up a one node inventory, with calico_apiserver_enabled set to true and run the following


ansible-galaxy install git+https://github.com/kubernetes-sigs/kubespray,v2.26.0
ansible-playbook kubernetes_sigs.kubespray.cluster.yml -i <inventory> -b
```__

### OS

Ansible execution:
Darwin 23.6.0 arm64

Inventory node:
Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64
NAME="Red Hat Enterprise Linux"
VERSION="9.2 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.2"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.2 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.2
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.2

### Version of Ansible

ansible [core 2.16.12]
  config file = /Users/robert.mitchell/workspaces/vocera-new/kubernetes-deploy/ansible.cfg
  configured module search path = ['/Users/robert.mitchell/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/robert.mitchell/.pyenv/versions/3.13.0/lib/python3.13/site-packages/ansible
  ansible collection location = /Users/robert.mitchell/.ansible/collections:/usr/share/ansible/collections
  executable location = /Users/robert.mitchell/.pyenv/versions/3.13.0/bin/ansible
  python version = 3.13.0 (main, Oct 15 2024, 19:26:31) [Clang 16.0.0 (clang-1600.0.26.3)] (/Users/robert.mitchell/.pyenv/versions/3.13.0/bin/python3.13)
  jinja version = 3.1.4
  libyaml = True

### Version of Python

Python 3.13.0

### Version of Kubespray (commit)

tag v2.26.0

### Network plugin used

calico

### Full inventory with variables

rayo-4 | SUCCESS => {
    "hostvars[inventory_hostname]": {
        "ansible_check_mode": false,
        "ansible_config_file": null,
        "ansible_diff_mode": false,
        "ansible_facts": {},
        "ansible_forks": 5,
        "ansible_host": "rayo-4.vcraeng.com",
        "ansible_inventory_sources": [
            "/Users/robert.mitchell/t2/inventory-simple2"
        ],
        "ansible_playbook_python": "/Users/robert.mitchell/.pyenv/versions/3.13.0/bin/python3.13",
        "ansible_user": "tpx-admin",
        "ansible_verbosity": 0,
        "ansible_version": {
            "full": "2.16.12",
            "major": 2,
            "minor": 16,
            "revision": 12,
            "string": "2.16.12"
        },
        "calico_apiserver_enabled": true,
        "group_names": [
            "etcd",
            "k8s_cluster",
            "kube_control_plane",
            "kube_node"
        ],
        "groups": {
            "all": [
                "rayo-4"
            ],
            "etcd": [
                "rayo-4"
            ],
            "k8s_cluster": [
                "rayo-4"
            ],
            "kube_control_plane": [
                "rayo-4"
            ],
            "kube_node": [
                "rayo-4"
            ],
            "ungrouped": []
        },
        "inventory_dir": "/Users/robert.mitchell/t2/inventory-simple2",
        "inventory_file": "/Users/robert.mitchell/t2/inventory-simple2/hosts.yml",
        "inventory_hostname": "rayo-4",
        "inventory_hostname_short": "rayo-4",
        "omit": "__omit_place_holder__a5795f6b9707da5dff1b0af78c75ed9e0c82f94f",
        "playbook_dir": "/Users/robert.mitchell/t2"
    }
}

### Command used to invoke ansible

DISPLAY_SKIPPED_HOSTS=false ansible-playbook kubernetes_sigs.kubespray.cluster.yml -i inventory-simple2 -b

### Output of ansible run

https://gist.github.com/bjetal/b834b61993bda7a085352c0c67038363

### Anything else we need to know

This appears to be due to changing to using the `manifest` key instead of `excludes` in `galaxy.xml` (revision 870049523fa6d8c1bb2dc069a8f889881f1c2d4a). If you run `galaxy collection build -vvv` the output shows that by default it only includes a limited set of file extensions from roles.   Files without extension and files with, for example, the `.conf` extension are not included.

One possible fix:  Add an additional entry to the manifest: `recursive-include roles **` to ensure all files in the roles directory are included.

This is not the only missing file due to the manifest configuration.  Based on a quick comparison, besides a few files that appear to be non-critical (e.g. OWNERS files), there are several other missing `.conf` files as well as missing shell scripts.
VannTen commented 1 week ago

Ouch, this really emphasizes that our CI does not cover wholly the collection usage :/

VannTen commented 1 week ago

870049523fa6d8c1bb2dc069a8f889881f1c2d4a is only in 2.26 though so I'm not sure how you have this on 2.25 ?

Could you check if the linked PR fix your issue ?

bjetal commented 1 week ago

The 2.25 was a silly mistake on my part. The rest of the details do reference 2.26 and that is what I was using.

bjetal commented 1 week ago

Referencing version 2.25 was a mental error on my part. I was actually using 2.26 and my steps to reproduce use that version.

The PR shoud fix the immediate issue I had, but does not appear to me to entirely fix the issue. This is a list I put together of the files that are missing (not including tests, OWNERS files, one README.md and one .gitkeep file). Note that several of the files do not have an extension:

roles/network_plugin/macvlan/files/ifup-local
roles/network_plugin/macvlan/files/ifup-macvlan
roles/network_plugin/macvlan/files/ifdown-local
roles/network_plugin/macvlan/files/ifdown-macvlan
roles/network_plugin/calico/files/openssl.conf
roles/bootstrap-os/files/bootstrap.sh
roles/kubernetes/preinstall/files/dhclient_nodnsupdate
roles/kubernetes/tokens/files/kube-gen-token.sh
roles/container-engine/docker/files/cleanup-docker-orphans.sh
roles/container-engine/youki/molecule/default/files/10-mynet.conf
roles/container-engine/cri-dockerd/molecule/default/files/10-mynet.conf
roles/container-engine/gvisor/molecule/default/files/10-mynet.conf
roles/container-engine/cri-o/files/mounts.conf
roles/container-engine/cri-o/molecule/default/files/10-mynet.conf
roles/container-engine/kata-containers/molecule/default/files/10-mynet.conf

I got this list by running a find command on the roles directory that was designed to return only files not matching the extensions that the Galaxy manifest process includes by default.

VannTen commented 1 week ago

Hum, maybe a better way to fix this is actually to do something like

Since regardless of the extension, anythings we put in files/ should be needed