microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.62k stars 548 forks source link

grafana exits abnormally #5069

Open zerozakiihitoshiki opened 3 years ago

zerozakiihitoshiki commented 3 years ago

Short summary about the issue/question: Dashboard is not accessible on the OpenPAI webportal.

Brief what process you are following:

  1. When I access webportal, dashboard cannot be connected.
  2. I use kubectl get pods to check grafana, and get:
    e78eafb722f00cc8ab251d805c73cee3-taskrole-0   1/1     Running            0          10m
    fluentd-ds-hnkb8                              1/1     Running            0          10d
    frameworkcontroller-sts-0                     1/1     Running            0          10d
    grafana-7d7c5b46fd-j67gz                      0/1     CrashLoopBackOff   14         62m
    hivedscheduler-ds-default-0                   1/1     Running            0          10d
    hivedscheduler-hs-0                           1/1     Running            0          10d
    internal-storage-create-ds-p5644              1/1     Running            14         10d
    job-exporter-7xr7j                            1/1     Running            28         9d
    log-manager-ds-kx4tt                          2/2     Running            0          10d
    node-exporter-fcpwf                           1/1     Running            0          9d
    postgresql-ds-lpdwm                           2/2     Running            0          10d
    prometheus-deployment-c6955f49c-kmkw5         1/1     Running            0          9d
    pylon-ds-2cgcs                                1/1     Running            0          9d
    rest-server-ds-sc5cv                          1/1     Running            0          9d
    watchdog-649bd8998c-dhgm5                     1/1     Running            0          10d
    webportal-ds-7c5f2                            1/1     Running            0          9d
  3. I run kubectl describe pod grafana-7d7c5b46fd-j67gz, and get:
    Name:           grafana-7d7c5b46fd-j67gz
    Namespace:      default
    Priority:       0
    Node:           openpai-master-01/172.168.3.101
    Start Time:     Mon, 09 Nov 2020 16:09:25 +0800
    Labels:         app=grafana
                pod-template-hash=7d7c5b46fd
    Annotations:    <none>
    Status:         Running
    IP:             172.168.x.y
    Controlled By:  ReplicaSet/grafana-7d7c5b46fd
    Containers:
    grafana:
    Container ID:   docker://ac3ebe69fde4b7993fe0d6ca97d3a1ae40314c918574c703f631e1250f631834
    Image:          openpai/grafana:v1.0.1
    Image ID:       docker-pullable://openpai/grafana@sha256:681fa4e661e77c57c65eb1f3db09419399f7b6c1d2c83b9545403ad1af311f10
    Port:           3000/TCP
    Host Port:      3000/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 09 Nov 2020 17:08:10 +0800
      Finished:     Mon, 09 Nov 2020 17:09:01 +0800
    Ready:          False
    Restart Count:  14
    Environment:
      GRAFANA_URL:                http://172.168.x.y:3000
      GF_AUTH_ANONYMOUS_ENABLED:  true
    Mounts:
      /grafana-configuration from grafana-confg-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-58lnf (ro)
    Conditions:
    Type              Status
    Initialized       True
    Ready             False
    ContainersReady   False
    PodScheduled      True
    Volumes:
    grafana-confg-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      grafana-configuration
    Optional:  false
    default-token-58lnf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-58lnf
    Optional:    false
    QoS Class:       BestEffort
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
    Events:
    Type     Reason   Age                    From                        Message
    ----     ------   ----                   ----                        -------
    Normal   Created  60m (x4 over 63m)      kubelet, openpai-master-01  Created container grafana
    Normal   Started  60m (x4 over 63m)      kubelet, openpai-master-01  Started container grafana
    Normal   Pulling  58m (x5 over 63m)      kubelet, openpai-master-01  Pulling image "openpai/grafana:v1.0.1"
    Normal   Pulled   58m (x5 over 63m)      kubelet, openpai-master-01  Successfully pulled image "openpai/grafana:v1.0.1"
    Warning  BackOff  3m19s (x218 over 61m)  kubelet, openpai-master-01  Back-off restarting failed container
  1. And the log of grafana:
    
    Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X GET -k http://172.168.x.y:3000/api/datasources
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 172.168.x.y port 3000: Connection refused
    Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X GET -k http://172.168.x.y:3000/api/datasources
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
    0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 172.168.x.y port 3000: Connection refused

Restart grafana after installing plugins .

t=2020-11-09T09:14:11+0000 lvl=info msg="Starting Grafana" logger=server version=4.6.3 commit=7a06a47 compiled=2017-12-14T08:36:59+0000 t=2020-11-09T09:14:11+0000 lvl=info msg="Config loaded from" logger=settings file=/usr/share/grafana/conf/defaults.ini t=2020-11-09T09:14:11+0000 lvl=info msg="Config loaded from" logger=settings file=/etc/grafana/grafana.ini t=2020-11-09T09:14:11+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.data=/var/lib/grafana" t=2020-11-09T09:14:11+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.logs=/var/log/grafana" t=2020-11-09T09:14:11+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.plugins=/grafana-plugins" t=2020-11-09T09:14:11+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_PATHS_DATA=/var/lib/grafana" t=2020-11-09T09:14:11+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_PATHS_LOGS=/var/log/grafana" t=2020-11-09T09:14:11+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_AUTH_ANONYMOUS_ENABLED=true" t=2020-11-09T09:14:11+0000 lvl=info msg="Path Home" logger=settings path=/usr/share/grafana t=2020-11-09T09:14:11+0000 lvl=info msg="Path Data" logger=settings path=/var/lib/grafana t=2020-11-09T09:14:11+0000 lvl=info msg="Path Logs" logger=settings path=/var/log/grafana t=2020-11-09T09:14:11+0000 lvl=info msg="Path Plugins" logger=settings path=/grafana-plugins t=2020-11-09T09:14:11+0000 lvl=info msg="App mode production" logger=settings t=2020-11-09T09:14:11+0000 lvl=info msg="Initializing DB" logger=sqlstore dbtype=sqlite3 t=2020-11-09T09:14:11+0000 lvl=info msg="Starting DB migration" logger=migrator t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create migration_log table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create user table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index user.login" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index user.email" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index UQE_user_login - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index UQE_user_email - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Rename table user to user_v1 - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create user table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_user_login - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_user_email - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="copy data_source v1 to v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table user_v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column help_flags1 to user table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update user table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add last_seen_at column to user" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create temp user table v1-7" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_temp_user_email - v1-7" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_temp_user_org_id - v1-7" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_temp_user_code - v1-7" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_temp_user_status - v1-7" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update temp_user table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create star table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index star.user_id_dashboard_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create org table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_org_name - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create org_user table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_org_user_org_id - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_org_user_org_id_user_id - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="copy data account to org" t=2020-11-09T09:14:11+0000 lvl=info msg="Skipping migration condition not fulfilled" logger=migrator id="copy data account to org" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="copy data account_user to org_user" t=2020-11-09T09:14:11+0000 lvl=info msg="Skipping migration condition not fulfilled" logger=migrator id="copy data account_user to org_user" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table account" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table account_user" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update org table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update org_user table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create dashboard table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index dashboard.account_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index dashboard_account_id_slug" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create dashboard_tag table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index dashboard_tag.dasboard_id_term" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index UQE_dashboard_tag_dashboard_id_term - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Rename table dashboard to dashboard_v1 - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create dashboard v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_dashboard_org_id - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_dashboard_org_id_slug - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="copy dashboard v1 to v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop table dashboard_v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="alter dashboard.data to mediumtext v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column updated_by in dashboard - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column created_by in dashboard - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column gnetId in dashboard" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add index for gnetId in dashboard" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column plugin_id in dashboard" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add index for plugin_id in dashboard" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add index for dashboard_id in dashboard_tag" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update dashboard table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update dashboard_tag table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create data_source table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index data_source.account_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index data_source.account_id_name" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index IDX_data_source_account_id - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index UQE_data_source_account_id_name - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Rename table data_source to data_source_v1 - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create data_source table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_data_source_org_id - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_data_source_org_id_name - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="copy data_source v1 to v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table data_source_v1 #2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column with_credentials" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add secure json data column" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update data_source table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create api_key table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index api_key.account_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index api_key.key" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index api_key.account_id_name" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index IDX_api_key_account_id - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index UQE_api_key_key - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop index UQE_api_key_account_id_name - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Rename table api_key to api_key_v1 - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create api_key table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_api_key_org_id - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_api_key_key - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_api_key_org_id_name - v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="copy api_key v1 to v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table api_key_v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update api_key table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create dashboard_snapshot table v4" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop table dashboard_snapshot_v4 #1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create dashboard_snapshot table v5 #2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_dashboard_snapshot_key - v5" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_dashboard_snapshot_delete_key - v5" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index IDX_dashboard_snapshot_user_id - v5" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="alter dashboard_snapshot to mediumtext v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update dashboard_snapshot table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create quota table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_quota_org_id_user_id_target - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update quota table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create plugin_setting table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create index UQE_plugin_setting_org_id_plugin_id - v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column plugin_version to plugin_settings" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update plugin_setting table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create session table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table playlist table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old table playlist_item table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create playlist table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create playlist item table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update playlist table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update playlist_item table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop preferences table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="drop preferences table v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create preferences table v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update preferences table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create alert table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index alert org_id & id " t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index alert state" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index alert dashboard_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create alert_notification table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column is_default" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index alert_notification org_id & name" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update alert table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update alert_notification table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop old annotation table v4" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create annotation table v5" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index annotation 0 v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index annotation 1 v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index annotation 2 v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index annotation 3 v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index annotation 4 v3" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update annotation table charset" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column region_id to annotation table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Drop category_id index" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add column tags to annotation table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Create annotation_tag table v2" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Add unique index annotation_tag.annotation_id_tag_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Update alert annotations and set TEXT to empty" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create test_data table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create dashboard_version table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index dashboard_version.dashboard_id" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index dashboard_version.dashboard_id and dashboard_version.version" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="Set dashboard version to 1 where 0" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="save existing dashboard data in dashboard_version table v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="alter dashboard_version.data to mediumtext v1" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="create tag table" t=2020-11-09T09:14:11+0000 lvl=info msg="Executing migration" logger=migrator id="add index tag.key_value" t=2020-11-09T09:14:11+0000 lvl=info msg="Created default admin user: admin" t=2020-11-09T09:14:11+0000 lvl=info msg="Starting plugin search" logger=plugins t=2020-11-09T09:14:11+0000 lvl=warn msg="Plugin dir does not exist" logger=plugins dir=/grafana-plugins t=2020-11-09T09:14:11+0000 lvl=info msg="Plugin dir created" logger=plugins dir=/grafana-plugins t=2020-11-09T09:14:12+0000 lvl=info msg="Initializing Alerting" logger=alerting.engine t=2020-11-09T09:14:12+0000 lvl=info msg="Initializing CleanUpService" logger=cleanup t=2020-11-09T09:14:12+0000 lvl=info msg="Initializing Stream Manager" t=2020-11-09T09:14:12+0000 lvl=info msg="Initializing HTTP Server" logger=http.server address=0.0.0.0:3000 protocol=http subUrl= socket= Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X GET -k http://172.168.x.y:3000/api/datasources % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2 100 2 0 0 37 0 --:--:-- --:--:-- --:--:-- 37 []Installing datasource /usr/local/grafana/datasources/prom-datasource.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/datasources --data @/usr/local/grafana/datasources/prom-datasource.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 179 100 49 100 130 935 2481 --:--:-- --:--:-- --:--:-- 2500 {"id":1,"message":"Datasource added","name":"PM"}installed ok Installing dashboard /usr/local/grafana/dashboards/gpu.js Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/gpu.js % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed t=2020-11-09T09:14:14+0000 lvl=info msg="Request Completed" logger=context userId=1 orgId=1 uname=admin method=POST path=/api/dashboards/import status=400 remote_addr=172.168.x.y time_ms=50 size=108 referer= 100 6441 100 108 100 6333 2092 119k --:--:-- --:--:-- --:--:-- 118k [{"classification":"DeserializationError","message":"invalid character '/' looking for beginning of value"}]installed ok Installing dashboard /usr/local/grafana/dashboards/pai-clusterview-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-clusterview-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7469 100 179 100 7290 3147 125k --:--:-- --:--:-- --:--:-- 127k {"pluginId":"","title":"PAI_ClusterView","imported":true,"importedUri":"db/pai_clusterview","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok Installing dashboard /usr/local/grafana/dashboards/pai-jobview-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-jobview-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9249 100 179 100 9070 3126 154k --:--:-- --:--:-- --:--:-- 158k {"pluginId":"","title":"JobLevelMetrics","imported":true,"importedUri":"db/joblevelmetrics","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok Installing dashboard /usr/local/grafana/dashboards/pai-nodeview-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-nodeview-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12009 100 173 100 11836 3080 205k --:--:-- --:--:-- --:--:-- 210k {"pluginId":"","title":"PAI_NodeView","imported":true,"importedUri":"db/pai_nodeview","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok Installing dashboard /usr/local/grafana/dashboards/pai-serviceview-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-serviceview-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7077 100 183 100 6894 3016 110k --:--:-- --:--:-- --:--:-- 112k {"pluginId":"","title":"PaiServiceMetrics","imported":true,"importedUri":"db/paiservicemetrics","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok Installing dashboard /usr/local/grafana/dashboards/pai-taskroleview-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-taskroleview-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9972 100 189 100 9783 3306 167k --:--:-- --:--:-- --:--:-- 170k {"pluginId":"","title":"TaskRoleLevelMetrics","imported":true,"importedUri":"db/taskrolelevelmetrics","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok Installing dashboard /usr/local/grafana/dashboards/pai-tasks-in-node-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-tasks-in-node-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 11622 100 175 100 11447 2967 189k --:--:-- --:--:-- --:--:-- 192k {"pluginId":"","title":"Tasks in Node","imported":true,"importedUri":"db/tasks-in-node","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok Installing dashboard /usr/local/grafana/dashboards/pai-taskview-dashboard.json Running curl -k -u admin:admin -H "Accept: application/json" -H "Content-Type: application/json" -X POST -k http://172.168.x.y:3000/api/dashboards/import --data @/usr/local/grafana/dashboards/pai-taskview-dashboard.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12623 100 181 100 12442 3081 206k --:--:-- --:--:-- --:--:-- 209k {"pluginId":"","title":"TaskLevelMetrics","imported":true,"importedUri":"db/tasklevelmetrics","slug":"","importedRevision":1,"revision":1,"description":"","path":"","removed":false}installed ok

5. memory:
          total        used        free      shared  buff/cache   available

Mem: 125G 9.0G 86G 505M 29G 114G Swap: 0B 0B 0B

**How to reproduce it**:

<!--Fill the following information if your issue need diagnostic support from the team, as minimally and precisely as possible!-->

**OpenPAI Environment**: 
- OpenPAI version:v1.0.1
- Cloud provider or hardware configuration: Physica host
- OS (e.g. from /etc/os-release): ubuntu 16.04
- Kernel (e.g. `uname -a`): GNU/Linux 4.4.0-190-generic x86_64
- Hardware (e.g. core number, memory size, storage size, GPU type etc.):
- Others:

**Anything else we need to know**:
I try to delete grafana and recreate, but the error still exits:

./paictl.py service stop -n grafana ./paictl.py service start -n grafana

Binyang2014 commented 3 years ago

@zerozakiihitoshiki The logs looks fine. We saw there are many pods restart several times such as internal-storage-create-ds-p564 and job-exporter-7xr7j. Is there anything special with our machine?

zerozakiihitoshiki commented 3 years ago

@zerozakiihitoshiki The logs looks fine. We saw there are many pods restart several times such as internal-storage-create-ds-p564 and job-exporter-7xr7j. Is there anything special with our machine?

I guess these pods restart may due to the work I submit. I submit a training job with the following configuration:

protocolVersion: 2
name: pytorch_hdfs03
type: job
jobRetryCount: 0
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
      minSucceededInstances: -1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 4
      memoryMB: 8192
    commands:
      - pip install hdfs &>> storage_plugin.log
      - touch ~/.hdfscli.cfg
      - 'echo ''[dev.alias]'' >> ~/.hdfscli.cfg'
      - 'echo ''url = http://172.168.x.y:50070'' >> ~/.hdfscli.cfg'
      - echo 'user = xxx' >> ~/.hdfscli.cfg
      - mkdir --parents /user
      - hdfscli download --alias=dev /admin/pytorch_101 /user
      - cd /user/pytorch_101
      - bash init.sh
      - python cifar.py --gpuid 0 --arch ResNet18 --epoch 200
      - hdfscli upload --alias=dev /user/pytorch_101 /admin/pytorch_101
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

This job restarted 8 times during the training and failed at last. I check the exit diagnostics, it shows that the exit code is 137. I have tried to submit this job many times, but the result is the same. I wonder if this error is related to the memory capacity of the host? But the machine seems to have plenty of memory.

zerozakiihitoshiki commented 3 years ago

In addition, I would like to know whether the administrator can customize PAI environment variables for users?

Binyang2014 commented 3 years ago

@zerozakiihitoshiki Can you provide kubelet logs in the master node? For kubelet logs, you can run journalctl -u kubelet to get the logs.

For customize PAI environment, if you known how to build docker image, you can provide your customized runtime plugin. Here are some examples: https://github.com/microsoft/openpai-runtime/tree/master/src/plugins.

You need to write the plugin, rebuild runtime image, change rest-server config to use your own runtime. For further issue about opnepai-runtime, you can create issue in this repo: https://github.com/microsoft/openpai-runtime

zerozakiihitoshiki commented 3 years ago

@Binyang2014

xxx@openpai-master-01:~$ sudo journalctl -u kubelet

-- Logs begin at 日 2020-11-01 20:46:21 CST, end at 二 2020-11-17 15:45:58 CST. --
11月 01 20:46:21 openpai-master-01 kubelet[156802]: I1101 20:46:21.251895  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:46:24 openpai-master-01 kubelet[156802]: E1101 20:46:24.752430  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:46:28 openpai-master-01 kubelet[156802]: I1101 20:46:28.034053  156802 kubelet_getters.go:177] status for pod kube-apiserver-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:46:28 openpai-master-01 kubelet[156802]: I1101 20:46:28.035515  156802 kubelet_getters.go:177] status for pod kube-scheduler-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:46:28 openpai-master-01 kubelet[156802]: I1101 20:46:28.036373  156802 kubelet_getters.go:177] status for pod kube-controller-manager-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000
11月 01 20:46:31 openpai-master-01 kubelet[156802]: I1101 20:46:31.276766  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:46:37 openpai-master-01 kubelet[156802]: E1101 20:46:37.750536  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:46:41 openpai-master-01 kubelet[156802]: I1101 20:46:41.294971  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:46:49 openpai-master-01 kubelet[156802]: E1101 20:46:49.750372  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:46:51 openpai-master-01 kubelet[156802]: I1101 20:46:51.317909  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:01 openpai-master-01 kubelet[156802]: I1101 20:47:01.336430  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:01 openpai-master-01 kubelet[156802]: E1101 20:47:01.751964  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:47:04 openpai-master-01 kubelet[156802]: I1101 20:47:04.577776  156802 endpoint.go:111] State pushed for device plugin github.com/fuse
11月 01 20:47:11 openpai-master-01 kubelet[156802]: I1101 20:47:11.358836  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:14 openpai-master-01 kubelet[156802]: E1101 20:47:14.753529  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:47:21 openpai-master-01 kubelet[156802]: I1101 20:47:21.377454  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:28 openpai-master-01 kubelet[156802]: I1101 20:47:28.037332  156802 kubelet_getters.go:177] status for pod kube-apiserver-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:47:28 openpai-master-01 kubelet[156802]: I1101 20:47:28.037536  156802 kubelet_getters.go:177] status for pod kube-scheduler-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:47:28 openpai-master-01 kubelet[156802]: I1101 20:47:28.037587  156802 kubelet_getters.go:177] status for pod kube-controller-manager-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000
11月 01 20:47:29 openpai-master-01 kubelet[156802]: E1101 20:47:29.750447  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:47:31 openpai-master-01 kubelet[156802]: I1101 20:47:31.396447  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:41 openpai-master-01 kubelet[156802]: I1101 20:47:41.413406  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:41 openpai-master-01 kubelet[156802]: E1101 20:47:41.750410  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:47:51 openpai-master-01 kubelet[156802]: I1101 20:47:51.430966  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:47:54 openpai-master-01 kubelet[156802]: E1101 20:47:54.751416  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:48:01 openpai-master-01 kubelet[156802]: I1101 20:48:01.449858  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:48:04 openpai-master-01 kubelet[156802]: I1101 20:48:04.577861  156802 endpoint.go:111] State pushed for device plugin github.com/fuse
11月 01 20:48:08 openpai-master-01 kubelet[156802]: I1101 20:48:08.752661  156802 provider.go:124] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
11月 01 20:48:11 openpai-master-01 kubelet[156802]: I1101 20:48:11.468427  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:48:15 openpai-master-01 kubelet[156802]: I1101 20:48:14.969830  156802 kube_docker_client.go:345] Stop pulling image "openpai/grafana:v1.0.1": "Status: Image is up to date for openpai/grafana:v1.0.1"
11月 01 20:48:15 openpai-master-01 kubelet[156802]: I1101 20:48:15.731066  156802 kubelet.go:1933] SyncLoop (PLEG): "grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47c)", event: &pleg.PodLifecycleEvent{ID:"a2
11月 01 20:48:21 openpai-master-01 kubelet[156802]: I1101 20:48:21.488178  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:48:28 openpai-master-01 kubelet[156802]: I1101 20:48:28.038078  156802 kubelet_getters.go:177] status for pod kube-apiserver-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:48:28 openpai-master-01 kubelet[156802]: I1101 20:48:28.038256  156802 kubelet_getters.go:177] status for pod kube-scheduler-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:48:28 openpai-master-01 kubelet[156802]: I1101 20:48:28.038305  156802 kubelet_getters.go:177] status for pod kube-controller-manager-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000
11月 01 20:48:31 openpai-master-01 kubelet[156802]: I1101 20:48:31.507820  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:48:41 openpai-master-01 kubelet[156802]: I1101 20:48:41.524993  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:48:51 openpai-master-01 kubelet[156802]: I1101 20:48:51.543281  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:01 openpai-master-01 kubelet[156802]: I1101 20:49:01.564321  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:02 openpai-master-01 kubelet[156802]: I1101 20:49:02.321856  156802 kubelet.go:1933] SyncLoop (PLEG): "grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47c)", event: &pleg.PodLifecycleEvent{ID:"a2
11月 01 20:49:02 openpai-master-01 kubelet[156802]: E1101 20:49:02.322703  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:49:04 openpai-master-01 kubelet[156802]: I1101 20:49:04.577862  156802 endpoint.go:111] State pushed for device plugin github.com/fuse
11月 01 20:49:11 openpai-master-01 kubelet[156802]: I1101 20:49:11.582139  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:14 openpai-master-01 kubelet[156802]: E1101 20:49:14.751763  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:49:21 openpai-master-01 kubelet[156802]: I1101 20:49:21.601439  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:28 openpai-master-01 kubelet[156802]: I1101 20:49:28.038560  156802 kubelet_getters.go:177] status for pod kube-apiserver-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:49:28 openpai-master-01 kubelet[156802]: I1101 20:49:28.038747  156802 kubelet_getters.go:177] status for pod kube-scheduler-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020
11月 01 20:49:28 openpai-master-01 kubelet[156802]: I1101 20:49:28.038798  156802 kubelet_getters.go:177] status for pod kube-controller-manager-openpai-master-01 updated to {Running [{Initialized True 0001-01-01 00:00:00 +0000
11月 01 20:49:28 openpai-master-01 kubelet[156802]: E1101 20:49:28.750483  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:49:31 openpai-master-01 kubelet[156802]: I1101 20:49:31.620444  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:41 openpai-master-01 kubelet[156802]: I1101 20:49:41.639127  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:43 openpai-master-01 kubelet[156802]: E1101 20:49:43.752644  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:49:51 openpai-master-01 kubelet[156802]: I1101 20:49:51.656898  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:49:54 openpai-master-01 kubelet[156802]: E1101 20:49:54.751664  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47
11月 01 20:50:01 openpai-master-01 kubelet[156802]: I1101 20:50:01.680987  156802 setters.go:73] Using node IP: "172.168.x.y"
11月 01 20:50:04 openpai-master-01 kubelet[156802]: I1101 20:50:04.578863  156802 endpoint.go:111] State pushed for device plugin github.com/fuse
11月 01 20:50:09 openpai-master-01 kubelet[156802]: E1101 20:50:09.751389  156802 pod_workers.go:190] Error syncing pod a29a2bb5-9a4e-456e-9489-d1422a1df47c ("grafana-7d7c5b46fd-4h7sl_default(a29a2bb5-9a4e-456e-9489-d1422a1df47