secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
70 stars 49 forks source link

中心化部署kuscia,使用KusciaAPI创建任务成功,执行失败 #365

Open Tiger007x opened 1 month ago

Tiger007x commented 1 month ago

Issue Type

Install/Deploy

Search for existing issues similar to yours

Yes

OS Platform and Distribution

Ubuntu22.04

Kuscia Version

0.7.0b0

Deployment

docker

deployment Version

26.1.4

App Running type

secretflow

App Running version

1.3.0b0

Configuration file used to run kuscia.

# kubectl get domain alice -o yaml 
apiVersion: kuscia.secretflow/v1alpha1
kind: Domain
metadata:
  annotations:
    domain/alice: kuscia.secretflow/domain-type=embedded
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"Domain","metadata":{"annotations":{"domain/alice":"kuscia.secretflow/domain-type=embedded"},"name":"alice"},"spec":{"authCenter":{"authenticationType":"Token","tokenGenMethod":"UID-RSA-GEN"},"cert":null,"master":null,"role":null}}
  creationTimestamp: "2024-07-02T09:39:52Z"
  generation: 2
  labels:
    kuscia.secretflow/domain-auth: completed
  name: alice
  resourceVersion: "466939"
  uid: 8ff909b6-57d1-4e41-a38c-3d12584a61b8
spec:
  authCenter:
    authenticationType: Token
    tokenGenMethod: UID-RSA-GEN
  cert: xxxxx
status:
  deployTokenStatuses:
  - lastTransitionTime: "2024-07-02T09:39:52Z"
    state: used
    token: uuLlXxwfUquQGZVWzbTRbGKn23BmViCV
  - lastTransitionTime: "2024-07-02T09:45:07Z"
    state: unused
    token: eAI8YpZLMKiYCdwZGBgcndjaSbmyi9dF
  nodeStatuses:
  - lastHeartbeatTime: "2024-07-05T04:29:17Z"
    lastTransitionTime: "2024-07-02T09:45:11Z"
    name: root-kuscia-lite-alice
    status: Ready
    version: v0.7.0b0
# kubectl get domain bob-o yaml 
apiVersion: kuscia.secretflow/v1alpha1
kind: Domain
metadata:
  annotations:
    domain/bob: kuscia.secretflow/domain-type=embedded
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kuscia.secretflow/v1alpha1","kind":"Domain","metadata":{"annotations":{"domain/bob":"kuscia.secretflow/domain-type=embedded"},"name":"bob"},"spec":{"authCenter":{"authenticationType":"Token","tokenGenMethod":"UID-RSA-GEN"},"cert":null,"master":null,"role":null}}
  creationTimestamp: "2024-07-02T09:51:33Z"
  generation: 2
  labels:
    kuscia.secretflow/domain-auth: completed
  name: bob
  resourceVersion: "467469"
  uid: efb44777-528d-4b87-9942-2bfb37f71cc1
spec:
  authCenter:
    authenticationType: Token
    tokenGenMethod: UID-RSA-GEN
  cert: xxxxxxx
status:
  deployTokenStatuses:
  - lastTransitionTime: "2024-07-02T09:51:33Z"
    state: used
    token: ha5U9TsdAwYBsPt2mcwyKsazqYP42ls6
  - lastTransitionTime: "2024-07-02T09:52:21Z"
    state: unused
    token: kC88yQZ6Drixp9F4tSFVzWa85nfQMYTf
  nodeStatuses:
  - lastHeartbeatTime: "2024-07-05T04:33:53Z"
    lastTransitionTime: "2024-07-02T09:52:23Z"
    name: root-kuscia-lite-bob
    status: Ready
    version: v0.7.0b0

What happend and What you expected to happen.

调用 /api/v1/job/create 创建任务成功,查询任务状态报错:
{"status":{"code":0, "message":"success", "details":[]}, "data":{"job_id":"job-alice-bob-003", "initiator":"alice", "max_parallelism":2, "tasks":[{"app_image":"secretflow-image", "parties":[{"domain_id":"alice", "role":"partner"}, {"domain_id":"bob", "role":"partner"}], "alias":"job-psi-3", "task_id":"job-psi-3", "dependencies":[], "task_input_config":"{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"preprocessing\",\"name\":\"psi\",\"version\":\"0.0.1\",\"attr_paths\":[\"input/receiver_input/key\",\"input/sender_input/key\",\"protocol\",\"precheck_input\",\"bucket_size\",\"curve_type\"],\"attrs\":[{\"ss\":[\"id1\"]},{\"ss\":[\"id2\"]},{\"s\":\"ECDH_PSI_2PC\"},{\"b\":true},{\"i64\":\"1048576\"},{\"s\":\"CURVE_FOURQ\"}]},\"sf_input_ids\":[\"alice-table\",\"bob-table\"],\"sf_output_ids\":[\"psi-output\"],\"sf_output_uris\":[\"psi-output.csv\"]}", "priority":100}], "status":{"state":"Failed", "err_msg":"", "create_time":"2024-07-05T02:39:04Z", "start_time":"2024-07-05T02:39:04Z", "end_time":"2024-07-05T02:39:19Z", "tasks":[{"task_id":"job-psi-3", "state":"Failed", "err_msg":"The remaining no-failed party task counts 1 are less than the threshold 2 that meets the conditions for task success. pending party[], running party[bob-partner], successful party[], failed party[alice-partner]", "create_time":"2024-07-05T02:39:04Z", "start_time":"2024-07-05T02:39:04Z", "end_time":"2024-07-05T02:39:19Z", "parties":[{"domain_id":"alice", "state":"Failed", "err_msg":"container[secretflow] terminated state reason \"Error\", message: \"rtner-0-global.alice.svc` (via SIGTERM)\\n2024-07-05 02:39:12,591\\tVINFO scripts.py:1023 -- Send termination request to `/usr/local/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --store_socket_name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --object_manager_port=20489 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=20494 --node_ip_address=job-psi-3-partner-0-global.alice.svc --maximum_startup_concurrency=4 --static_resource_list=node:job-psi-3-partner-0-global.alice.svc,1.0,CPU,32,memory,4522824500,object_store_memory,2261412249 \\\"--python_worker_command=/usr/local/bin/python /usr/local/lib/python3.8/site-packages/ray/_private/workers/setup_worker.py /usr/local/lib/python3.8/site-packages/ray/_private/workers/default_worker.py --node-ip-address=job-psi-3-partner-0-global.alice.svc --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=34981 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000\\\" --java_worker_command= --cpp_worker_command= --native_library_path=/usr/local/lib/python3.8/site-packages/ray/cpp/lib --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32 --log_dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --resource_dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/runtime_resources --metrics-agent-port=34981 --metrics_export_port=53736 --object_store_memory=2261412249 --plasma_directory=/tmp --ray-debugger-external=0 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --session-name=session_2024-07-05_02-39-09_843365_32 \\\"--agent_command=/usr/local/bin/python -u /usr/local/lib/python3.8/site-packages/ray/dashboard/agent.py --node-ip-address=job-psi-3-partner-0-global.alice.svc --metrics-export-port=53736 --dashboard-agent-port=34981 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32 --runtime-env-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/runtime_resources --log-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2024-07-05_02-39-09_843365_32 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --minimal\\\" --node-name=job-psi-3-partner-0-global.alice.svc` (via SIGTERM)\\n2024-07-05 02:39:12,592\\tVINFO scripts.py:1023 -- Send termination request to `/usr/local/bin/python -u /usr/local/lib/python3.8/site-packages/ray/_private/log_monitor.py --logs-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5` (via SIGTERM)\\n2024-07-05 02:39:12,594\\tVINFO scripts.py:1023 -- Send termination request to `/usr/local/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --store_socket_name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --object_manager_port=20489 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=20494 --node_ip_address=job-psi-3-partner-0-global.alice.svc --maximum_startup_concurrency=4 --static_resource_list=node:job-psi-3-partner-0-global.alice.svc,1.0,CPU,32,memory,4522824500,object_store_memory,2261412249 \\\"--python_worker_command=/usr/local/bin/python /usr/local/lib/python3.8/site-packages/ray/_private/workers/setup_worker.py /usr/local/lib/python3.8/site-packages/ray/_private/workers/default_worker.py --node-ip-address=job-psi-3-partner-0-global.alice.svc --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=34981 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000\\\" --java_worker_command= --cpp_worker_command= --native_library_path=/usr/local/lib/python3.8/site-packages/ray/cpp/lib --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32 --log_dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --resource_dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/runtime_resources --metrics-agent-port=34981 --metrics_export_port=53736 --object_store_memory=2261412249 --plasma_directory=/tmp --ray-debugger-external=0 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --session-name=session_2024-07-05_02-39-09_843365_32 \\\"--agent_command=/usr/local/bin/python -u /usr/local/lib/python3.8/site-packages/ray/dashboard/agent.py --node-ip-address=job-psi-3-partner-0-global.alice.svc --metrics-export-port=53736 --dashboard-agent-port=34981 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32 --runtime-env-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/runtime_resources --log-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2024-07-05_02-39-09_843365_32 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --minimal\\\" --node-name=job-psi-3-partner-0-global.alice.svc` (via SIGTERM)\\n2024-07-05 02:39:12,594\\tVINFO scripts.py:1023 -- Send termination request to `/usr/local/bin/python -u /usr/local/lib/python3.8/site-packages/ray/dashboard/agent.py --node-ip-address=job-psi-3-partner-0-global.alice.svc --metrics-export-port=53736 --dashboard-agent-port=34981 --listen-port=52365 --node-manager-port=20494 --object-store-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-07-05_02-39-09_843365_32/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32 --runtime-env-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/runtime_resources --log-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2024-07-05_02-39-09_843365_32 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --minimal --agent-id 1059961393` (via SIGTERM)\\n2024-07-05 02:39:12,595\\tVINFO scripts.py:1023 -- Send termination request to `/usr/local/bin/python /usr/local/lib/python3.8/site-packages/ray/dashboard/dashboard.py --host=localhost --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32/logs --session-dir=/tmp/ray/session_2024-07-05_02-39-09_843365_32 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=job-psi-3-partner-0-global.alice.svc:20493 --minimal --modules-to-load=UsageStatsHead` (via SIGTERM)\\n2024-07-05 02:39:12,769\\tINFO scripts.py:1051 -- 1/7 stopped.\\r2024-07-05 02:39:12,769\\tINFO scripts.py:1051 -- 2/7 stopped.\\r2024-07-05 02:39:12,769\\tINFO scripts.py:1051 -- 3/7 stopped.\\r2024-07-05 02:39:12,769\\tINFO scripts.py:1051 -- 4/7 stopped.\\r2024-07-05 02:39:12,769\\tINFO scripts.py:1051 -- 5/7 stopped.\\r2024-07-05 02:39:12,769\\tINFO scripts.py:1051 -- 6/7 stopped.\\r2024-07-05 02:39:18,210\\tINFO scripts.py:1051 -- 7/7 stopped.\\r2024-07-05 02:39:18,210\\tSUCC scripts.py:1063 -- Stopped all 7 Ray processes.\\n\"", "endpoints":[{"port_name":"fed", "scope":"Cluster", "endpoint":"job-psi-3-partner-0-fed.alice.svc"}, {"port_name":"global", "scope":"Domain", "endpoint":"job-psi-3-partner-0-global.alice.svc:20493"}, {"port_name":"spu", "scope":"Cluster", "endpoint":"job-psi-3-partner-0-spu.alice.svc"}]}, {"domain_id":"bob", "state":"Failed", "err_msg":"", "endpoints":[{"port_name":"global", "scope":"Domain", "endpoint":"job-psi-3-partner-0-global.bob.svc:23819"}, {"port_name":"spu", "scope":"Cluster", "endpoint":"job-psi-3-partner-0-spu.bob.svc"}, {"port_name":"fed", "scope":"Cluster", "endpoint":"job-psi-3-partner-0-fed.bob.svc"}]}]}], "stage_status_list":[{"domain_id":"alice", "state":"JobCreateStageSucceeded"}, {"domain_id":"bob", "state":"JobCreateStageSucceeded"}], "approve_status_list":[{"domain_id":"alice", "state":"JobAccepted"}, {"domain_id":"bob", "state":"JobAccepted"}]}, "custom_fields":{}}}

Kuscia log output.

#alice
2024-07-05 10:39:04.354 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-fed] (5.923µs)
2024-07-05 10:39:04.354 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[alice/job-psi-3-partner-0-spu] (9.069µs)
2024-07-05 10:39:04.354 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-spu] (7.534µs)
2024-07-05 10:39:04.359 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-fed] (5.552µs)
2024-07-05 10:39:04.359 INFO controller/endpoints.go:189 Updating endpoint alice/job-psi-3-partner-0-fed/453915
2024-07-05 10:39:04.359 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[alice/job-psi-3-partner-0-fed] (6.859µs)
2024-07-05 10:39:04.359 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-global] (6.055µs)
2024-07-05 10:39:04.363 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-global] (6.415µs)
2024-07-05 10:39:04.363 INFO controller/endpoints.go:189 Updating endpoint alice/job-psi-3-partner-0-global/453917
2024-07-05 10:39:04.363 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[alice/job-psi-3-partner-0-global] (8.504µs)
2024-07-05 10:39:06.565 INFO source/apiserver.go:58 Receive pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" add event from apiserver
2024-07-05 10:39:06.565 INFO source/config.go:100 Pod change merged, source=api, adds=[job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)], updates=[], deletes=[], removes=[], reconciles=[]
2024-07-05 10:39:06.565 INFO framework/pods_controller.go:263 SyncLoop ADD, source=api, pods=[job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)]
2024-07-05 10:39:06.565 INFO framework/pods_controller.go:515 Sync pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" enter
2024-07-05 10:39:06.566 INFO pod/cri_provider.go:756 CRIProvider start syncing pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)"
2024-07-05 10:39:06.566 INFO resource/volume_manager.go:178 Mount (dump) file success, path=/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate/task-config.conf, mode=420, size=176
2024-07-05 10:39:06.566 INFO resource/volume_manager.go:103 Mount volumes map[config-template:{HostPath:/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate ReadOnly:true Managed:true SELinuxRelabel:true}] for pod "job-psi-3-partner-0" succeed
2024-07-05 10:39:06.566 INFO kuberuntime/kuberuntime_manager.go:325 No sandbox for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" can be found. Need to start a new one
2024-07-05 10:39:06.566 INFO kuberuntime/kuberuntime_manager.go:547 ComputePodActions got for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)": {true true  0 nil [0] map[] []}
2024-07-05 10:39:06.567 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Node", Namespace:"alice", Name:"root-kuscia-lite-alice", UID:"root-kuscia-lite-alice", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:06.567 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' pod: "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:06.641 INFO status/status_manager.go:625 Patch status for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)", patch={"metadata":{"uid":"611303bd-7eb4-4b90-a594-a26e9f1341aa"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2024-07-05T02:39:06Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2024-07-05T02:39:06Z","message":"containers with unready status: [secretflow]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2024-07-05T02:39:06Z","message":"containers with unready status: [secretflow]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"}],"containerStatuses":[{"image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0","imageID":"","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"waiting":{"reason":"ContainerCreating"}}}],"hostIP":"172.18.0.3","qosClass":null,"startTime":"2024-07-05T02:39:06Z"}}
2024-07-05 10:39:06.649 INFO source/apiserver.go:65 Receive pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" update event from apiserver
2024-07-05 10:39:06.650 INFO source/config.go:100 Pod change merged, source=api, adds=[], updates=[], deletes=[], removes=[], reconciles=[job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)]
2024-07-05 10:39:06.650 INFO framework/pods_controller.go:276 SyncLoop RECONCILE, source=api, pods=[job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)]
2024-07-05 10:39:06.709 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Node", Namespace:"alice", Name:"root-kuscia-lite-alice", UID:"root-kuscia-lite-alice", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:06.709 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' pod: "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:06.710 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:"spec.containers{secretflow}"}): type: 'Normal' reason: 'Pulled' Container image "secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0" already present on machine
2024-07-05 10:39:06.865 INFO pod/cri_provider.go:348 Receive pleg event: &{ID:611303bd-7eb4-4b90-a594-a26e9f1341aa Type:ContainerStarted Data:f5e1f8500a827258dce03511e5e8ab3adb89bcf23be11bfb21a9cbac20f3066b}
2024-07-05 10:39:06.885 INFO certissuance/cert_issuance.go:302 Successfully issued certificate(server=true,client=true) for container "secretflow" in pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)"
2024-07-05 10:39:06.885 INFO configrender/config_render.go:245 Render config template for container "secretflow" in pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" succeed, templatePath=/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate/task-config.conf, configPath=/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/config-render/secretflow/config-template/task-config.conf
2024-07-05 10:39:06.885 INFO pod/cri_provider.go:573 Successfully generated the run options of container "secretflow" in pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)"
2024-07-05 10:39:06.909 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:"spec.containers{secretflow}"}): type: 'Normal' reason: 'Created' Created container secretflow
2024-07-05 10:39:06.944 INFO framework/pods_controller.go:517 Sync pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" exit, isTerminal=false
2024-07-05 10:39:06.945 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:"spec.containers{secretflow}"}): type: 'Normal' reason: 'Started' Started container secretflow
2024-07-05 10:39:06.946 INFO framework/pods_controller.go:515 Sync pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" enter
2024-07-05 10:39:06.946 INFO pod/cri_provider.go:756 CRIProvider start syncing pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)"
2024-07-05 10:39:06.946 INFO resource/volume_manager.go:178 Mount (dump) file success, path=/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate/task-config.conf, mode=420, size=176
2024-07-05 10:39:06.946 INFO resource/volume_manager.go:103 Mount volumes map[config-template:{HostPath:/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate ReadOnly:true Managed:true SELinuxRelabel:true}] for pod "job-psi-3-partner-0" succeed
2024-07-05 10:39:06.946 INFO kuberuntime/kuberuntime_manager.go:547 ComputePodActions got for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)": {false false f5e1f8500a827258dce03511e5e8ab3adb89bcf23be11bfb21a9cbac20f3066b 0 nil [] map[] []}
2024-07-05 10:39:06.946 INFO framework/pods_controller.go:517 Sync pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" exit, isTerminal=false
2024-07-05 10:39:06.947 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Node", Namespace:"alice", Name:"root-kuscia-lite-alice", UID:"root-kuscia-lite-alice", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:06.947 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' pod: "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:06.965 INFO status/status_manager.go:625 Patch status for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)", patch={"metadata":{"uid":"611303bd-7eb4-4b90-a594-a26e9f1341aa"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"message":null,"reason":null,"status":"True","type":"Ready"},{"message":null,"reason":null,"status":"True","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://1050bbbee6ee9906b5952122cdfc5b25a545a5ef98f1b28a41488eca640365a2","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0","imageID":"sha256:b7b67c366e17624c5c8b44d1e76f49e135a6f3cbc822df1b4cba1b0fe5a04fef","lastState":{},"name":"secretflow","ready":true,"restartCount":0,"started":true,"state":{"running":{"startedAt":"2024-07-05T02:39:06Z"}}}],"phase":"Running","podIP":"10.88.0.5","podIPs":[{"ip":"10.88.0.5"}]}}
2024-07-05 10:39:06.966 INFO source/apiserver.go:65 Receive pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" update event from apiserver
2024-07-05 10:39:06.966 INFO source/config.go:100 Pod change merged, source=api, adds=[], updates=[], deletes=[], removes=[], reconciles=[job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)]
2024-07-05 10:39:06.966 INFO framework/pods_controller.go:276 SyncLoop RECONCILE, source=api, pods=[job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)]
2024-07-05 10:39:06.970 INFO controller/endpoints.go:189 Updating endpoint alice/job-psi-3-partner-0-spu/453979
2024-07-05 10:39:06.970 INFO xds/xds.go:434 Add cluster:service-job-psi-3-partner-0-spu
2024-07-05 10:39:06.971 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-spu] (8.006µs)
2024-07-05 10:39:06.971 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-global] (4.116µs)
2024-07-05 10:39:06.970 INFO controller/endpoints.go:189 Updating endpoint alice/job-psi-3-partner-0-global/453982
2024-07-05 10:39:06.971 INFO xds/xds.go:434 Add cluster:service-job-psi-3-partner-0-global
2024-07-05 10:39:06.975 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-fed] (7.631µs)
2024-07-05 10:39:06.975 INFO controller/endpoints.go:189 Updating endpoint alice/job-psi-3-partner-0-fed/453985
2024-07-05 10:39:06.976 INFO xds/xds.go:434 Add cluster:service-job-psi-3-partner-0-fed
2024-07-05 10:39:06.979 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-spu] (5.536µs)
2024-07-05 10:39:06.979 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[alice/job-psi-3-partner-0-spu] (8.390869ms)
2024-07-05 10:39:06.982 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[alice/job-psi-3-partner-0-global] (10.406168ms)
2024-07-05 10:39:06.982 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-global] (19.75µs)
2024-07-05 10:39:06.988 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[alice/job-psi-3-partner-0-fed] (5.745µs)
2024-07-05 10:39:06.989 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[alice/job-psi-3-partner-0-fed] (13.106722ms)
2024-07-05 10:39:07.867 INFO pod/cri_provider.go:348 Receive pleg event: &{ID:611303bd-7eb4-4b90-a594-a26e9f1341aa Type:ContainerStarted Data:1050bbbee6ee9906b5952122cdfc5b25a545a5ef98f1b28a41488eca640365a2}
2024-07-05 10:39:07.868 INFO framework/pods_controller.go:515 Sync pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" enter
2024-07-05 10:39:07.868 INFO status/status_manager.go:490 Ignoring same status for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)"
2024-07-05 10:39:07.868 INFO pod/cri_provider.go:756 CRIProvider start syncing pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)"
2024-07-05 10:39:07.868 INFO resource/volume_manager.go:178 Mount (dump) file success, path=/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate/task-config.conf, mode=420, size=176
2024-07-05 10:39:07.869 INFO resource/volume_manager.go:103 Mount volumes map[config-template:{HostPath:/home/kuscia/var/pods/611303bd-7eb4-4b90-a594-a26e9f1341aa/volumes/kubernetes.io~configmap/job-psi-3-configtemplate ReadOnly:true Managed:true SELinuxRelabel:true}] for pod "job-psi-3-partner-0" succeed
2024-07-05 10:39:07.869 INFO kuberuntime/kuberuntime_manager.go:547 ComputePodActions got for pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)": {false false f5e1f8500a827258dce03511e5e8ab3adb89bcf23be11bfb21a9cbac20f3066b 0 nil [] map[] []}
2024-07-05 10:39:07.869 INFO framework/pods_controller.go:517 Sync pod "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)" exit, isTerminal=false
2024-07-05 10:39:07.869 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Node", Namespace:"alice", Name:"root-kuscia-lite-alice", UID:"root-kuscia-lite-alice", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:07.869 INFO record/event.go:285 Event(v1.ObjectReference{Kind:"Pod", Namespace:"alice", Name:"job-psi-3-partner-0", UID:"611303bd-7eb4-4b90-a594-a26e9f1341aa", APIVersion:"v1", ResourceVersion:"453956", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' pod: "job-psi-3-partner-0_alice(611303bd-7eb4-4b90-a594-a26e9f1341aa)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
2024-07-05 10:39:18.883 INFO pleg/generic.go:303 Generic (PLEG): container finished, podID=611303bd-7eb4-4b90-a594-a26e9f1341aa, containerID=1050bbbee6ee9906b5952122cdfc5b25a545a5ef98f1b28a41488eca640365a2, exitCode=1

#bob  
2024-07-05 10:39:29.299 INFO framework/pods_controller.go:661 Sync terminating pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" exit
2024-07-05 10:39:29.307 INFO framework/pods_controller.go:669 Sync terminated pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" enter
2024-07-05 10:39:29.344 INFO pod/cri_provider.go:811 CRIProvider start deleting pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)"
2024-07-05 10:39:29.344 INFO framework/pods_controller.go:685 Sync terminated pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" exit
2024-07-05 10:39:29.449 INFO source/apiserver.go:65 Receive pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" update event from apiserver
2024-07-05 10:39:29.450 INFO source/config.go:100 Pod change merged, source=api, adds=[], updates=[], deletes=[], removes=[], reconciles=[job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)]
2024-07-05 10:39:29.450 INFO framework/pods_controller.go:276 SyncLoop RECONCILE, source=api, pods=[job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)]
2024-07-05 10:39:29.454 INFO status/status_manager.go:625 Patch status for pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)", patch={"metadata":{"uid":"a1aa6f55-1f52-45fd-96bc-525b36d74dde"},"status":{"$setElementOrder/conditions":[{"type":"Initialized"},{"type":"Ready"},{"type":"ContainersReady"},{"type":"PodScheduled"}],"conditions":[{"lastTransitionTime":"2024-07-05T02:39:29Z","reason":"PodFailed","status":"False","type":"Ready"},{"lastTransitionTime":"2024-07-05T02:39:29Z","reason":"PodFailed","status":"False","type":"ContainersReady"}],"containerStatuses":[{"containerID":"containerd://d9872dba311405bb3ba83355c6aff2419af1f0be3db3aba1bb685ee0d3971a5a","image":"secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.3.0b0","imageID":"sha256:b7b67c366e17624c5c8b44d1e76f49e135a6f3cbc822df1b4cba1b0fe5a04fef","lastState":{},"name":"secretflow","ready":false,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://d9872dba311405bb3ba83355c6aff2419af1f0be3db3aba1bb685ee0d3971a5a","exitCode":15,"finishedAt":"2024-07-05T02:39:28Z","message":"WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n2024-07-05 02:39:15,099|bob|INFO|secretflow|entry.py:start_ray:56| ray_conf: RayConfig(ray_node_ip_address='job-psi-3-partner-0-global.bob.svc', ray_node_manager_port=23820, ray_object_manager_port=23821, ray_client_server_port=23822, ray_worker_ports=[], ray_gcs_port=23819)\n2024-07-05 02:39:15,099|bob|INFO|secretflow|entry.py:start_ray:60| Trying to start ray head node at job-psi-3-partner-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=2 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=job-psi-3-partner-0-global.bob.svc --port=23819 --node-manager-port=23820 --object-manager-port=23821 --ray-client-server-port=23822\n","reason":"Error","startedAt":"2024-07-05T02:39:07Z"}}}],"phase":"Failed","podIP":null,"podIPs":null}}
2024-07-05 10:39:29.454 INFO framework/pod.go:751 Pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" is terminated and all resources are reclaimed
2024-07-05 10:39:29.457 INFO controller/endpoints.go:189 Updating endpoint bob/job-psi-3-partner-0-spu/454081
2024-07-05 10:39:29.501 INFO status/status_manager.go:653 Pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" fully terminated and removed from etcd
2024-07-05 10:39:29.503 INFO controller/endpoints.go:189 Updating endpoint bob/job-psi-3-partner-0-global/454083
2024-07-05 10:39:29.503 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-spu] (9.762µs)
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-global] (23.75µs)
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-fed] (4.107µs)
2024-07-05 10:39:29.507 INFO source/apiserver.go:65 Receive pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" update event from apiserver
2024-07-05 10:39:29.507 INFO source/apiserver.go:72 Receive pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" delete event from apiserver
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-spu] (3.322µs)
2024-07-05 10:39:29.507 INFO source/config.go:100 Pod change merged, source=api, adds=[], updates=[], deletes=[job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)], removes=[], reconciles=[]
2024-07-05 10:39:29.507 INFO source/config.go:100 Pod change merged, source=api, adds=[], updates=[], deletes=[], removes=[job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)], reconciles=[]
2024-07-05 10:39:29.507 INFO framework/pods_controller.go:279 SyncLoop DELETE, source=api, pods=[job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)]
2024-07-05 10:39:29.507 INFO framework/pods_controller.go:273 SyncLoop REMOVE, source=api, pods=[job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)]
2024-07-05 10:39:29.507 INFO framework/pods_controller.go:347 Pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" has been deleted and must be killed
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-fed] (5.778µs)
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-global] (2.723µs)
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[kuscia-coredns-controller], key[bob/job-psi-3-partner-0-spu] (2.722µs)
2024-07-05 10:39:29.507 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[bob/job-psi-3-partner-0-global] (419.249µs)
2024-07-05 10:39:29.583 INFO controller/endpoints.go:189 Updating endpoint bob/job-psi-3-partner-0-fed/454084
2024-07-05 10:39:29.583 INFO controller/endpoints.go:189 Updating endpoint bob/job-psi-3-partner-0-fed/454093
2024-07-05 10:39:29.583 INFO controller/endpoints.go:189 Updating endpoint bob/job-psi-3-partner-0-global/454094
2024-07-05 10:39:29.583 INFO controller/endpoints.go:189 Updating endpoint bob/job-psi-3-partner-0-spu/454095
2024-07-05 10:39:29.583 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[bob/job-psi-3-partner-0-fed] (266.327µs)
2024-07-05 10:39:29.583 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[bob/job-psi-3-partner-0-global] (106.138µs)
2024-07-05 10:39:29.583 INFO status/status_manager.go:604 Pod "job-psi-3-partner-0_bob(a1aa6f55-1f52-45fd-96bc-525b36d74dde)" does not exist on the server
2024-07-05 10:39:29.503 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[bob/job-psi-3-partner-0-spu] (33.477µs)
2024-07-05 10:39:29.591 INFO queue/queue.go:124 Finish processing item: queue id[endpoints-queue], key[bob/job-psi-3-partner-0-spu] (7.978222ms)
zimu-yuxi commented 1 month ago

提供一下任务执行日志,/home/kuscia/var/stdout下,job-psi-3的文件夹下可以找到任务日志

github-actions[bot] commented 1 week ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.