lenadroid / airflow-azure

19 stars 3 forks source link

File share volumes fail to mount #1

Open hussaynv opened 3 years ago

hussaynv commented 3 years ago

Hi

I'm wonderring if you can help, I have been following the guide, but I seem to have an issue with the volumes not mounting when connecting to the file shares I have setup when the scheduler and web pods are creating.

This is the output from describing the scheduler pod:

Name:           scheduler-fb5579585-g56sx
Namespace:      airflow3
Priority:       0
Node:           aks-nodepool1-10959598-vmss000000/10.203.30.4
Start Time:     Fri, 12 Feb 2021 12:17:23 +0000
Labels:         component=scheduler
                pod-template-hash=fb5579585
                release=RELEASE-NAME
                tier=airflow
Annotations:    checksum/airflow-config: 6132d4c762bec566a83667e8a23486fcbc29157811f277b66e6568047f627c14
                checksum/metadata-secret: a3512f27fea8455cdddc51ef650052d74657bcaa16194d24b555417e312d43da
                checksum/pgbouncer-config-secret: da52bd1edfe820f0ddfacdebb20a4cc6407d296ee45bcb500a6407e2261a5ba2
                checksum/result-backend-secret: 4bd4a60ef60435fe29fc8135a43a436c0854074a228246c67a6e7488b138200f
                cluster-autoscaler.kubernetes.io/safe-to-evict: true
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/scheduler-fb5579585
Init Containers:
  run-airflow-migrations:
    Container ID:
    Image:         apache/airflow:1.10.10.1-alpha2-python3.6
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      -c
      airflow upgradedb || airflow db upgrade
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      AIRFLOW__CORE__FERNET_KEY:        <set to the key 'fernet-key' in secret 'fernet-key'>        Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-metadata'>  Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:          <set to the key 'connection' in secret 'airflow-metadata'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from scheduler-serviceaccount-token-f2vgh (ro)
Containers:
  scheduler:
    Container ID:
    Image:         apache/airflow:1.10.10.1-alpha2-python3.6
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      scheduler
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Liveness:       exec [python -Wignore -c import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'

from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.net import get_hostname
import sys

job = SchedulerJob.most_recent_job()
sys.exit(0 if job.is_alive() and job.hostname == get_hostname() else 1)
] delay=0s timeout=1s period=30s #success=1 #failure=10
    Environment:
      AIRFLOW__CORE__FERNET_KEY:        <set to the key 'fernet-key' in secret 'fernet-key'>        Optional: false
      AIRFLOW__CORE__SQL_ALCHEMY_CONN:  <set to the key 'connection' in secret 'airflow-metadata'>  Optional: false
      AIRFLOW_CONN_AIRFLOW_DB:          <set to the key 'connection' in secret 'airflow-metadata'>  Optional: false
    Mounts:
      /opt/airflow/airflow.cfg from config (ro,path="airflow.cfg")
      /opt/airflow/dags from dags-pv (rw)
      /opt/airflow/logs from logs-pv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from scheduler-serviceaccount-token-f2vgh (ro)
  scheduler-gc:
    Container ID:
    Image:         apache/airflow:1.10.10.1-alpha2-python3.6
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      /clean-logs
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /opt/airflow/logs from logs-pv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from scheduler-serviceaccount-token-f2vgh (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      airflow-config
    Optional:  false
  dags-pv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  dags-pvc
    ReadOnly:   false
  logs-pv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  logs-pvc
    ReadOnly:   false
  scheduler-serviceaccount-token-f2vgh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  scheduler-serviceaccount-token-f2vgh
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                   From                                        Message
  ----     ------       ----                  ----                                        -------
  Warning  FailedMount  45m (x15 over 5h41m)  kubelet, aks-nodepool1-10959598-vmss000000  MountVolume.MountDevice failed for volume "logs-pv" : rpc error: code = Internal desc = volume(fs-logs) mount "//testairflowpoc.file.core.windows.net/logshare" on "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/logs-pv/globalmount" failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t cifs -o dir_mode=0777,file_mode=0777,uid=0,gid=0,mfsymlinks,cache=strict,nosharesock,actimeo=30,vers=3.0,<masked> //testairflowpoc.file.core.windows.net/logshare /var/lib/kubelet/plugins/kubernetes.io/csi/pv/logs-pv/globalmount
Output: mount error(13): Permission denied
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)
  Warning  FailedMount  31m (x68 over 5h37m)   kubelet, aks-nodepool1-10959598-vmss000000  Unable to attach or mount volumes: unmounted volumes=[logs-pv dags-pv], unattached 
volumes=[scheduler-serviceaccount-token-f2vgh logs-pv dags-pv config]: timed out waiting for the condition
  Warning  FailedMount  11m (x147 over 5h41m)  kubelet, aks-nodepool1-10959598-vmss000000  MountVolume.MountDevice failed for volume "dags-pv" : rpc error: code = Internal desc = volume(fs-dags) mount "//testairflowpoc.file.core.windows.net/dagshare" on "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/dags-pv/globalmount" failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t cifs -o dir_mode=0777,file_mode=0777,uid=0,gid=0,mfsymlinks,cache=strict,nosharesock,vers=3.0,actimeo=30,<masked> //testairflowpoc.file.core.windows.net/dagshare /var/lib/kubelet/plugins/kubernetes.io/csi/pv/dags-pv/globalmount
Output: mount error(13): Permission denied
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)
  Warning  FailedMount  6m35s (x24 over 5h33m)  kubelet, aks-nodepool1-10959598-vmss000000  Unable to attach or mount volumes: unmounted volumes=[logs-pv dags-pv], unattached volumes=[logs-pv dags-pv config scheduler-serviceaccount-token-f2vgh]: timed out waiting for the condition
  Warning  FailedMount  63s (x153 over 5h41m)   kubelet, aks-nodepool1-10959598-vmss000000  MountVolume.MountDevice failed for volume "logs-pv" : rpc error: code = Internal desc = volume(fs-logs) mount "//testairflowpoc.file.core.windows.net/logshare" on "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/logs-pv/globalmount" failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t cifs -o dir_mode=0777,file_mode=0777,uid=0,gid=0,mfsymlinks,cache=strict,nosharesock,vers=3.0,actimeo=30,<masked> //testairflowpoc.file.core.windows.net/logshare /var/lib/kubelet/plugins/kubernetes.io/csi/pv/logs-pv/globalmount
Output: mount error(13): Permission denied
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

The volumes dont appear to mount and I'm not too sure why. At the moment there are no dag definition or any files in the shares, not sure if that should make any difference. From what I can see the PV's and PVC's have been created as well as the Secrets with the storage name and account key as described in the guide.

I dont know if I need to do anything on the file shares or storage account to grant access or if I am missing something. AKS v1.18.8

Any help would be great

hussaynv commented 3 years ago

I think I have worked out what was wrong in my case, I believe the secret created containing the storage account name and storage account key listed here needs to be slightly differnet as outlined here in the Azure docs:

kubectl create secret generic azure-secret --from-literal=azurestorageaccountname=$AKS_PERS_STORAGE_ACCOUNT_NAME --from-literal=azurestorageaccountkey=$STORAGE_KEY

rather than

kubectl create secret generic azure-secret --from-literal accountname=$STORAGE_ACCOUNT --from-literal accountkey=$STORAGE_ACCOUNT_KEY --type=Opaque

As outlined in the guide.

This appears to be the one change I made to get it working in my setup and now the scheduler and web server pods are running as expected.