vectordotdev / helm-charts

Helm charts for Vector.
https://vector.dev
Mozilla Public License 2.0
104 stars 90 forks source link

Could not create subdirectory "k8s_logs" inside of data dir "/vector-data-dir": Permission denied (os error 13) #266

Open arve0 opened 1 year ago

arve0 commented 1 year ago

Hi! I get the error message on start:

2022-11-07T14:01:32.136164Z ERROR vector::topology: Configuration error. error=Source "k8s_logs": Could not create subdirectory "k8s_logs" inside of data dir "/vector-data-dir": Permission denied (os error 13)

I use the following setup:

role: Agent

customConfig:
  data_dir: "/vector-data-dir"
  sources:
    k8s_logs:
      type: kubernetes_logs
  sinks:
    opensearch:
      type: elasticsearch
      endpoint: https://opensearch:9200
      inputs:
        - k8s_logs
      mode: bulk
      compression: none
      auth:
        strategy: basic
        user: xxxxx
        password: xxxxx
      tls:
        verify_certificate: false
        verify_hostname: false

I've tried adding an init container:

        - name: data-dir-permissions
          image: registry.access.redhat.com/ubi9
          command: ["bash", "-c", "set -x; id; ls -ld /vector-data-dir; chgrp -R 3000 /vector-data-dir; chmod g+rwx /vector-data-dir; ls -ld /vector-data-dir"]
          securityContext:
            privileged: true
          volumeMounts:
          - name: data
            mountPath: /vector-data-dir

and using uid/guid/fsuid 3000 in vector:

      containers:
        - name: vector
          image: "timberio/vector:0.24.1-distroless-libc"
          securityContext:
            runAsUser: 3000
            runAsGroup: 3000
            fsGroup: 3000

But it still fails. Debugging the container:

❯ oc debug vector-fffwp --image=ubi9
Starting pod/vector-fffwp-debug ...
Pod IP: 10.128.2.53
If you don't see a command prompt, try pressing enter.
sh-5.1$ id 
uid=3000(3000) gid=3000 groups=3000
sh-5.1$ ls -ld /vector-data-dir/
drwxrwxr-x. 2 root 3000 6 Nov  7 13:43 /vector-data-dir/
sh-5.1$ mkdir -p /vector-data-dir/a
mkdir: cannot create directory '/vector-data-dir/a': Permission denied

Any ideas?

arve0 commented 1 year ago

Viewed from host, uid/gid seems correct:

❯ oc debug node/domstoltestocpin101
Starting pod/domstoltestocpin101-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.242.158.20
If you don't see a command prompt, try pressing enter.
sh-4.4# ls -ld /host/var/lib/vector
drwxrwxr-x. 2 3000 3000 6 Nov  7 13:43 /host/var/lib/vector
swartz-k commented 1 year ago

In my case, the error message is as below.

2023-01-11T06:56:35.919540Z ERROR vector::topology: Configuration error. error=Source "task_log": Could not create subdirectory "task_log" inside of data dir "/var/lib/vector/": Read-only file system (os error 30)

This because of PodSpec' volumeMount error. You can check your volumeMount if readOnly add or post your pod yaml.

Source Code from here

spencergilbert commented 1 year ago

Trying to reproduce this today with the following config (updated for latest Helm and Vector versions):

role: Agent

service:
  enabled: false
serviceHeadless:
  enabled: false

customConfig:
  data_dir: "/vector-data-dir"
  sources:
    k8s_logs:
      type: kubernetes_logs
  sinks:
    opensearch:
      type: elasticsearch
      endpoint: https://opensearch:9200
      inputs:
        - k8s_logs
      mode: bulk
      bulk:
        index: "vector-%Y.%m.%d"
      compression: none
      auth:
        strategy: basic
        user: xxxxx
        password: xxxxx
      tls:
        verify_certificate: false
        verify_hostname: false

I don't see any error when running locally on colima:

❯ kubectl logs pod/vector-6zh9r
2023-03-09T14:23:54.444835Z  INFO vector::app: Internal log rate limit configured. internal_log_rate_secs=10
2023-03-09T14:23:54.448176Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,lapin=info,kube=info"
2023-03-09T14:23:54.448602Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
2023-03-09T14:23:54.499656Z  INFO source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::sources::kubernetes_logs: Obtained Kubernetes Node name to collect logs for (self). self_node_name="colima"
2023-03-09T14:23:54.587269Z  INFO source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::sources::kubernetes_logs: Excluding matching files. exclude_paths=["**/*.gz", "**/*.tmp"]
2023-03-09T14:23:54.589787Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-03-09T14:23:54.594123Z  WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-03-09T14:23:54.594898Z  WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.

I suspect this is due to restrictions imposed by OpenShift. Could you confirm you're still seeing this issue after upgrading to latest?

arve0 commented 1 year ago

I suspect this is due to restrictions imposed by OpenShift.

I can confirm that. When adding a SecurityContextConstraint with correct permissions, it works.

Would you like me to contribute back the SecurityContextConstraint under a flag, say openshift: true?

spencergilbert commented 1 year ago

I suspect this is due to restrictions imposed by OpenShift.

I can confirm that. When adding a SecurityContextConstraint with correct permissions, it works.

Would you like me to contribute back the SecurityContextConstraint under a flag, say openshift: true?

That'd be great - I don't have too much experience with OpenShift, but if that's a normal/expected resource to create in OS clusters that seems good.

Honken77 commented 9 months ago

I suspect this is due to restrictions imposed by OpenShift.

I can confirm that. When adding a SecurityContextConstraint with correct permissions, it works.

Would you like me to contribute back the SecurityContextConstraint under a flag, say openshift: true?

What was the fix? I tried with a custom privileged scc and for troubleshooting set runAsUser to 0 but I still get the permission errors.

Edit: I had to set privileged: true in the container security context for it to work.

arve0 commented 9 months ago

Edit: I had to set privileged: true in the container security context for it to work.

Correct. I set it in values to chart:

securityContext:
  privileged: true

Then added SCC, Role and RoleBinding on the side:

# vector trenger priviligert tilgang for å skrive til /var/lib/vector på node.
# Kun initContainer bruker priviligert tilgang, vector-containeren kjøres som uid/guid 3000.
---
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: privileged-and-hostpath
  annotations:
    kubernetes.io/description: |
      Kopiert fra restricted. Har i tillegg allowHostDirVolumePlugin=true, volumes:hostpath
      og allowPrivilegedContainer=true.
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities: null
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups: []
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- hostPath
- persistentVolumeClaim
- projected
- secret
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: use-privileged-and-hostpath
rules:
  - apiGroups:
      - security.openshift.io
    resources:
      - securitycontextconstraints
    verbs:
      - use
    resourceNames:
      - privileged-and-hostpath
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: vector-can-use-privileged-and-hostpath
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: use-privileged-and-hostpath
subjects:
  - kind: ServiceAccount
    name: vector

I tried using SecurityContextConstraints.allowedCapabilities without allowPrivilegedContainer, but never got that working. Found that openshift-logging also uses allowPrivilegedContainer, so settled with that.

jonasbartho commented 5 months ago

Hi,

Try to avoid setting privileged: true, because it is basically giving the vector pod root access to the underlying host.

Configure your scc to this again and remove privileged: true:

allowPrivilegeEscalation: false
allowPrivilegedContainer: false

Then add this in your daemonset:

      - op: add
        path: "/spec/template/spec/containers/0/securityContext"
        value:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - CHOWN
            drop:
            - KILL
            - DAC_OVERRIDE
            - FOWNER
            - NET_BIND_SERVICE
            - FSETID
            - SETGID
            - SETUID
            - SETPCAP
          privileged: false
          seLinuxOptions:
            type: container_logwriter_t
          seccompProfile:
            type: RuntimeDefault

and I would suggest applying this MachineConfig to the nodes where vector is running(with me it is on all my worker nodes):

variant: openshift
version: 4.14.0
metadata:
  name: 50-selinux-file-contexts-local
  labels:
    machineconfiguration.openshift.io/role: worker
storage:
  files:
    - path: /etc/selinux/targeted/contexts/files/file_contexts.local
      mode: 0644
      overwrite: true
      contents:
        inline: |
          /var/lib/vector(/.*)?    system_u:object_r:container_file_t:s0
systemd:
      units:
        - contents: |-
            [Unit]
            Description=Set local SELinux file context for vector

            [Service]
            ExecStart=/bin/bash -c '/usr/bin/mkdir -p /var/lib/vector;restorecon -Rv /var/lib/vector'
            RemainAfterExit=yes
            Type=oneshot

            [Install]
            WantedBy=multi-user.target
          enabled: true
          name: set-SELinux-context-local.service