failed install pixie #668

zainal-abidin-assegaf commented 1 year ago

Failed to get auth credentials: open /root/.pixie/auth.json: too many open files

  1. px deploy

pixie installed successfully

core@manager-01 ~ $ sudo px deploy
Pixie CLI

Running Cluster Checks:
 ✔    Kernel version > 4.14.0 
 ✔    Cluster type is supported 
 ✔    K8s version > 1.16.0 
 ✔    Kubectl > 1.10.0 is present 
 ✔    User can create namespace 
 ✕    Cluster type is in list of known supported types  ERR: Cluster type is not in list of known supported cluster types. Please see: https://docs.px.dev/installing-pixie/requirements/
Some cluster checks failed. Pixie may not work properly on your cluster. Continue with deploy? (y/n) [y] : y
Installing Vizier version: 0.12.9
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: kubernetes-the-hard-way
Is the cluster correct? (y/n) [y] : y
Found 7 nodes
 ✔    Installing OLM CRDs 
 ✔    Deploying OLM 
 ✔    Deploying Pixie OLM Namespace 
 ✔    Installing Vizier CRD 
 ✔    Deploying OLM Catalog 
 ✔    Deploying OLM Subscription 
 ✔    Creating namespace 
 ✔    Deploying Vizier 
 ✔    Waiting for Cloud Connector to come online 
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin 
 ⠹    Wait for healthcheck 
Failed to get auth credentials: open /root/.pixie/auth.json: too many open files

Name:               worker-001.bnpb.go.id
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
Annotations:        csi.volume.kubernetes.io/nodeid:
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 21 Feb 2022 12:48:17 +0700
Taints:             <none>
Unschedulable:      false
  HolderIdentity:  worker-001.bnpb.go.id
  AcquireTime:     <unset>
  RenewTime:       Thu, 15 Dec 2022 13:07:14 +0700
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 15 Dec 2022 13:02:35 +0700   Mon, 12 Dec 2022 02:02:43 +0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 15 Dec 2022 13:02:35 +0700   Mon, 12 Dec 2022 02:02:43 +0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 15 Dec 2022 13:02:35 +0700   Mon, 12 Dec 2022 02:02:43 +0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 15 Dec 2022 13:02:35 +0700   Mon, 12 Dec 2022 02:02:46 +0700   KubeletReady                 kubelet is posting ready status
  Hostname:    worker-001.bnpb.go.id
  cpu:                8
  ephemeral-storage:  505527372Ki
  hugepages-2Mi:      0
  memory:             16383340Ki
  pods:               110
  cpu:                8
  ephemeral-storage:  465894025264
  hugepages-2Mi:      0
  memory:             16280940Ki
  pods:               110
System Info:
  Machine ID:                      e0635838c9144f4fb5ccc31b8e5e2f1e
  System UUID:                     4220591d-21b6-ae6f-f712-2e8c6f439004
  Boot ID:                         f686d433-406a-4d58-8132-99e091f179d9
  Kernel Version:                  5.15.77-flatcar
  OS Image:                        Flatcar Container Linux by Kinvolk 3374.2.1 (Oklo)
  Operating System:                linux
  Architecture:                    amd64
  Container Runtime Version:       containerd://1.6.8
  Kubelet Version:                 v1.23.4
  Kube-Proxy Version:              v1.23.4
Non-terminated Pods:               (18 in total)
  Namespace                        Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                        ----                                                               ------------  ----------  ---------------  -------------  ---
  default                          dnstools-66b47fd85b-kj8kd                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         66d
  default                          pg-single-0                                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         192d
  inarisk                          inariskapiv3-6f85d8c9cb-bgtq8                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d18h
  inarisk                          inariskwebv4-66f488ffd4-fwbbn                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         236d
  inarisk                          pushalert-76848556c9-tlhkw                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         181d
  kube-system                      coredns-7cdff66f64-nv88n                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (1%)     181d
  kube-system                      fluentd-fvj62                                                      100m (1%)     0 (0%)      200Mi (1%)       200Mi (1%)     5d20h
  kube-system                      metrics-server-847dcc659d-7hh56                                    100m (1%)     0 (0%)      200Mi (1%)       0 (0%)         181d
  kubernetes-dashboard             kubernetes-dashboard-887c5ff5-4jsn2                                100m (1%)     2 (25%)     200Mi (1%)       200Mi (1%)     181d
  metallb-system                   speaker-v7fhd                                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         292d
  nfs-subdir-external-provisioner  nfs-subdir-external-provisioner-858494fb9-hqcpb                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         195d
  platform                         signoz-clickhouse-operator-5c5b6cbfd6-dzkv6                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         24h
  platform                         signoz-k8s-infra-otel-agent-7pxft                                  100m (1%)     0 (0%)      100Mi (0%)       0 (0%)         24h
  rook-ceph                        csi-cephfsplugin-mdtl7                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         296d
  rook-ceph                        csi-rbdplugin-84zpx                                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         296d
  rook-ceph                        rook-ceph-crashcollector-worker-001.bnpb.go.id-5f9798f8dd-c78sb    0 (0%)        0 (0%)      0 (0%)           0 (0%)         25m
  rook-ceph                        rook-ceph-mds-myfs-b-545bb57f8c-qmxg5                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         25m
  rook-ceph                        rook-ceph-osd-2-95b6dd85c-dfdnt                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         236d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                500m (6%)   2 (25%)
  memory             770Mi (4%)  570Mi (3%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

aimichelle commented 1 year ago

Hey @4ss3g4f ! We've seen this issue with our CLI before (https://github.com/pixie-io/pixie/issues/312). Would you mind increasing the file descriptor limit temporarily to see if that helps? ulimit -n 10240

zainal-abidin-assegaf commented 1 year ago

Hi @aimichelle Still error,

core@localrepo ~ $ sudo px deploy
Pixie CLI

Running Cluster Checks:
 ✔    Kernel version > 4.14.0 
 ✔    Cluster type is supported 
 ✔    K8s version > 1.16.0 
 ✔    Kubectl > 1.10.0 is present 
 ✔    User can create namespace 
 ✕    Cluster type is in list of known supported types  ERR: Cluster type is not in list of known supported cluster types. Please see: https://docs.px.dev/installing-pixie/requirements/
Some cluster checks failed. Pixie may not work properly on your cluster. Continue with deploy? (y/n) [y] : y
Installing Vizier version: 0.12.9
Generating YAMLs for Pixie
Deploying Pixie to the following cluster: kubernetes-the-hard-way
Is the cluster correct? (y/n) [y] : y
Found 1 nodes
 ✔    Installing OLM CRDs 
 ✔    Deploying OLM 
 ✔    Deploying Pixie OLM Namespace 
 ✔    Installing Vizier CRD 
 ✔    Deploying OLM Catalog 
 ✔    Deploying OLM Subscription 
 ✔    Creating namespace 
 ✔    Deploying Vizier 
 ✔    Waiting for Cloud Connector to come online 
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin 
 ✔    Wait for PEMs/Kelvin 
 ✕    Wait for healthcheck  ERR: timeout waiting for healthcheck  (it is possible that Pixie stabilized after the healthcheck timeout. To check if Pixie successfully deployed, run `px debug pods`)
Failed Pixie healthcheck error=timeout waiting for healthcheck  (it is possible that Pixie stabilized after the healthcheck timeout. To check if Pixie successfully deployed, run `px debug pods`)
core@localrepo ~ $ sudo px debug pods
Pixie CLI
Cluster ID : 02aeec08-6cf0-4e4d-b5c8-5d506ad15029
Could not fetch Vizier pods error=context deadline exceeded
zainal-abidin-assegaf commented 1 year ago
core@localrepo ~ $ sudo kubectl get all -n px-operator
NAME                                                                  READY   STATUS      RESTARTS   AGE
pod/5ff7d47213a8875e3f1827d728b149498a0fdef08ba74499866d7d51a4qpwsz   0/1     Completed   0          23m
pod/pixie-operator-index-k28cr                                        1/1     Running     0          23m
pod/vizier-operator-88fbf7f87-hvm5j                                   1/1     Running     0          22m

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/pixie-operator-index   ClusterIP   <none>        50051/TCP   23m

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vizier-operator   1/1     1            1           22m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/vizier-operator-88fbf7f87   1         1         1       22m

NAME                                                                        COMPLETIONS   DURATION   AGE
job.batch/5ff7d47213a8875e3f1827d728b149498a0fdef08ba74499866d7d51a4b0147   1/1           35s        23m
core@localrepo ~ $ sudo px debug pods
Pixie CLI
Cluster ID : 02aeec08-6cf0-4e4d-b5c8-5d506ad15029
Could not fetch Vizier pods error=context deadline exceeded
core@localrepo ~ $ sudo kubectl logs vizier-operator-88fbf7f87-hvm5j -n px-operator
time="2022-12-16T08:40:22Z" level=info msg="Starting manager"
time="2022-12-16T08:40:23Z" level=info msg="Reconciling Vizier..." req=pl/pixie
time="2022-12-16T08:40:23Z" level=info msg="Creating a new vizier instance"
time="2022-12-16T08:40:23Z" level=info msg="Starting a vizier deploy"
time="2022-12-16T08:40:23Z" level=info msg="Deploying Vizier configs and secrets"
time="2022-12-16T08:40:23Z" level=info msg="Generating certs"
time="2022-12-16T08:40:34Z" level=info msg="Deploying NATS"
time="2022-12-16T08:40:34Z" level=info msg="Deploying Vizier"
time="2022-12-16T08:41:54Z" level=info msg="Vizier deploy is complete"
time="2022-12-16T08:41:54Z" level=info msg="Reconciling Vizier..." req=pl/pixie
time="2022-12-16T08:41:54Z" level=info msg="Updating Vizier..."
time="2022-12-16T08:41:54Z" level=info msg="Checksums matched, no need to reconcile"
time="2022-12-16T08:41:54Z" level=info msg="Reconciling Vizier..." req=pl/pixie
time="2022-12-16T08:41:54Z" level=info msg="Updating Vizier..."
time="2022-12-16T08:41:54Z" level=info msg="Checksums matched, no need to reconcile"
time="2022-12-16T08:42:14Z" level=info msg="Reconciling Vizier..." req=pl/pixie
time="2022-12-16T08:42:14Z" level=info msg="Updating Vizier..."
time="2022-12-16T08:42:14Z" level=info msg="Checksums matched, no need to reconcile"
time="2022-12-16T08:43:54Z" level=info msg="Reconciling Vizier..." req=pl/pixie
time="2022-12-16T08:43:54Z" level=info msg="Updating Vizier..."
time="2022-12-16T08:43:54Z" level=info msg="Checksums matched, no need to reconcile"
time="2022-12-16T08:48:34Z" level=info msg="Reconciling Vizier..." req=pl/pixie
time="2022-12-16T08:48:34Z" level=info msg="Updating Vizier..."
time="2022-12-16T08:48:34Z" level=info msg="Checksums matched, no need to reconcile"
core@localrepo ~ $ 
zainal-abidin-assegaf commented 1 year ago

is pixie used liveness and readiness probe for health check ?? we can redirect liveness and readiness probe to have successful installation

scomri commented 1 year ago

I also can't get px deploy to finish running. Always crashes after waiting for health checks. Tried to fix with setting ulimit to 10240 and deleting auth.json - still without success.

Pixie logs:

[cloudshell-user@ip-10-2-93-38 ~]$ px collect-logs
Pixie CLI
WARN[0001] Failed to log pod: kelvin-76cd6f549c-zd5cn    error="container \"app\" in pod \"kelvin-76cd6f549c-zd5cn\" is waiting to start: PodInitializing"
WARN[0001] Failed to log pod: vizier-pem-djq78           error="container \"pem\" in pod \"vizier-pem-djq78\" is waiting to start: PodInitializing"
WARN[0001] Failed to log pod: vizier-pem-s8mhd           error="container \"pem\" in pod \"vizier-pem-s8mhd\" is waiting to start: PodInitializing"
WARN[0001] Failed to log pod: vizier-pem-xwhwf           error="container \"pem\" in pod \"vizier-pem-xwhwf\" is waiting to start: PodInitializing"
WARN[0002] Failed to log pod: vizier-query-broker-798754d8d9-9j9cr  error="container \"app\" in pod \"vizier-query-broker-798754d8d9-9j9cr\" is waiting to start: PodInitializing"
jberryman commented 16 hours ago

FWIW this resolved the mystery issue I was having: https://github.com/pixie-io/pixie/issues/2006#issuecomment-2332063599