secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 56 forks source link

PSI 运行错误 #381

Closed ruhengChen closed 4 months ago

ruhengChen commented 4 months ago

Issue Type

Install/Deploy

Search for existing issues similar to yours

Yes

OS Platform and Distribution

Linux ecs-46f7 4.19.90-17.5.ky10.aarch64 #1 SMP Fri Aug 7 13:35:33 CST 2020 aarch64 aarch64 aarch64 GNU/Linux

Kuscia Version

0.9.0b0

Deployment

docker

deployment Version

24.0.8

App Running type

secretflow

App Running version

1.7.0b0

Configuration file used to run kuscia.

# alice
mode: autonomy
domainID: alice
domainKeyData: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBMVFvelMwbEFYM1JOM2xxV0QzZHJBcmN6V3c1dW5RdGhWcXJMMjAzTHU5Mml3RXVSCkRjZjJabWJGSXJwMGtxc0U1bWpMcXJRZSs1TXBSZnNwbmg3dU9mU2RUaWR5R09kbE9TNWhyS2NIZGxkWTc2Mk4KNnJmSHFCd2hEY05mcmZPU21TRG5Ld2VKU0o4bmRkOWZiTVo0WlZzVTVaUENNdDI0ekU3MjVYZU9CZjdpVm5SZApRQzllMHJyRlk4eUVneG8zckpuWnRVbTBDMllrcWNCSXRITFpIUjlDNmpsTmxxZ0tFT2VnN3dFbytWVFA4T2QxCjdHMWJyRFdoQ1kwUXRGZUxQSnFlM2NIVCtXZ1R6RnVOYXNPREwxcWJ6STZwZysxcS9LVEVoNXB4UDRoUEhhSHEKMGErTE03V2Q4RFVza2xQWUFHZTRmeHFEUXJ3djVIK040YWpoU3dJREFRQUJBb0lCQVFDY3RKOGNmdXBpREh3RwoxaDFSalNiaTNYMWlPbEIxSmx6WVVsUVhvYmIrSHI3THNnb2wxL1BRU1VJekZISVJQTWtpN3V6NVNQc05WS2RrCmVhYVlUK291S1ZmWW1EMWVRajk3K3prUHRlRlFWRm53RzNxcnI3bW1WK0tjYkIwaUtHdXFSY0NsTHlyMWgxU1EKYU5tWmVyZ3UzZnVXRDhVbWcxK2VzV3I5U1o3bm1zZlhPMTNBb3J3VE1KZkllcy9rK090NERJTVBpMXk2aktDNAp1bTZGVm9Wa04vNjQ4dUxaUHZjbHlCRTVoQnlQT3ZxOGszZTB4YnRQQldud2hhWWVhTlU1VnJIWXhkSTQvdkMyCkt1VXROcXJUOHkveGZ1Y3lNNmZzVDZtZk9PbFNjSzRTTENUS0JQTWdDSzViR0IrVFhKb3MrY3lpcURZYlVjcmEKS2FQVnQrbVpBb0dCQU5tVCtnZElRVnZyN2FFb3g0eUFkZTVhV3lpMCttd2JoOVpjeE92OXIrZ0grQVF1WDZ4WQpDcmE2TnJlUHRlWThXUjJMTlVkQTdPdWtXNElKNmVYQUlFeGE0dHFNUlhzL0hHQ0FrU2pBN2VkQjQzZTVSM093CjFlTmFjQ2RSU1dsYSt1eGF3SWZCR2hXc0p5WXFZblVIQmEwcnozVVVsdGN2TDh4aWxXRXFjck05QW9HQkFQcXAKRVB6UHhtUkk0RXZndW5DdzU1RTJ6UGttQURwOGN4WTRmUFRsYkkyVkd5b0VvYlR2V2g2ejlseUZRd1M4c3VtSgpyd2w2elg4eTFnMnlrYnJ3MHB3TlpZWHgzaXVLenhlM083eThnYmhRV3k2YmloTldqQkVMNlhDakU1K1lFa21KCm1IeE43UThwRUUrb3NBSitOdFNXQVpPYlFuWVpLdGlFalNtWFd3OG5Bb0dCQU5aZ3JBRVMyM3N6cWU2Wm5JdysKWW5SWVNPdUI3aUFvdkIybFNFM3hwcW1yZUY2K1Jud3NQMW90Zmc0OUpnL1ZORVVjellFSVlxZ0hTRGFIaUJOYwp5ellRY3VhOVFSU2ZhVmxZTkM2QWNhZmpUcm0vTDd5NDV2WGRQUll3VEhIbk44YzczK21paFE0SGtvZFRTYkZmCkd1TzJmL0V2T1RTS2hNRVAvWGxBZHNWTkFvR0FJYnZuSVY4RklESCtuYmVjMzlXdkZJZi9oZmhyUjNQQU9WbFkKMTh5WWsyVmh4b0hoRVd5MUFEVFFEeHVRTFc4SDFRNUdsRXdHby91L2V4QnhOR3Q4ckt2UTRmbnZJSVVKNGZNegpBdStMdGJaNnp3YjN5aXAzcDBPbkl4V1Bhb2NZenZUSnBORUgrbVpZWDZBZ0wvVzBnMSs4enBTeW1ScEIvZW00CjVjUU02VGtDZ1lBV2pOMllRaERrL3FveU5EU1FveDl2MlZUNFdBNnY1V3ZlMk93aFloY1UrRXgvZ1B0TjQwYkoKaGNFZlFZZjJLRGsxZk5YZm0ydFoyVVBWOU1LTlBwK3lKVE9VclZVR0t1RjR2TFkrL0VRcEdEU2ZKR2JZdGFIcQpzZEFjeXJyRWtKS2VybFRNMVh3SDVvWDFsOWpEVko1ZC9lR3B2WGtPc2psWmVxQjhYWVBHR3c9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=
logLevel: INFO
protocol: mtls
runtime: runc
runk:
  namespace: ""
  dnsServers: []
  kubeconfigFile: ""
capacity:
  cpu: ""
  memory: ""
  pods: ""
  storage: ""
reservedResources:
  cpu: ""
  memory: ""
image:
  pullPolicy: ""
  defaultRegistry: ""
  registries: []
datastoreEndpoint: ""

#bob
mode: autonomy
domainID: bob
domainKeyData: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFb3dJQkFBS0NBUUVBM0xHUlpEQVcxSlJtd2VwZjBKWWtQSitSWWhRTm9LSzNESmNHR0dWTFUxZHNkM3U4ClNsbTIyby94R0Q3RGJlVThwWkpkdGNlTkdVWjlSamlvSTVkL2tGcVlqS25kTnBWbWRHelBtRFhoMmJhcXBkQVUKaldFOXMzWTg0N1BqYXZZWDBhMnZyTk5WQUNVVUY4bHhKazAvQzRCQ2t6d3pvUTJ5c29iZkFzTGRiL0l4cUwxSApCL1B4YlNoNGthYjhPTEFkY0tnUElBWVkxd3NKZ1JRVkhsWVIvZTZHelZxYTIxRmRrNTRDNXpFb2FRSmxINVpVCnBZN280aTB4MXlwTlkyRXdvUC9UU3RjTXlBeHA3cXRMaUlZZzh3UWlFdy9OdWFBY3luZ0RKZUgvTnZybkVVWHAKdnFwVTJKRGRnM2J3Q0ZJSHJXNjZteG1QbWVSb3pHa3V1eFFxa1FJREFRQUJBb0lCQUFMaDd0dzRKSlF5UEFWYwpZbFAvSWdvSXE0VjBiWmtqaHZDTEtIRTVJWHE5TVpWOThEK29YRk1PZmorcTBqS2xROTJGdytPVDc2dmMxOVlLCjkyYy9tMUx2Vy82NldVRlZRamxURW9NU2NSaSs0Z3U0WkF4VXNOR2ZRYnhYcFNqSWZoY25CWnhrUmoveVBBanoKZ0o3WGMzTmJBWU9hemJIVTAvaXcra2kyOHQxN2JpRzgxRGVCSHNkaW13NFFDVWt3ek44WElHd09lVEZ5QjlwSwowV1ZNRjlLM05sVEM0clhGL3BRdlpGRmlYeFBDeUx3bXU3blA0dWVtSHJTUUc5aVZvbWtSYWI5YWZhbUJaL2syCkRBeXRQdTZNeExKdWhHa3g1Q1d6eUhRS3U1bm1UZ29nV2paNFZNYTBSNjhnaXZFSUtiMjJjdDhDT0J5Tnh2NU4KRGJrMDVnRUNnWUVBM3RVSzhYSUxsYXBPUzJDaGFuV2pFcmxRT0VhOFBEQUgzT2lJVHBiT0ZhNTJBYXY3dzJJbwpHRVdFV2l1cTRncmp5ZXVaNWhJYS9YcmZNNmFKbVlMeEg1elpxZjV0QmVSMUx4NUZQNDNJRUQxRnIxSVhWLzkxCk5JcnI0aXZBVjk5SWk0NVdIZEkyeUk5WUZoQnBNYW8vbDBVUHovOSt1UmxLNDUwVTlmSFFuNmtDZ1lFQS9Zc0oKRTdFRmxRZlNST3EyejVuQkQ0aXdnaWlTZGN4VUlidWltUWw2ZmR6ckRGRGJhVGhhTmt4UWlXWEpJS1ZXcnFsZwo5R1RxSXBJUVI0NUtjeEk4N1lRenJLaDBmVkhzNnhQN005WjNtbGRVTk1tVVVlenJHcHpnYXpLWll2ZHMvaEw0CkVhaXpMTitpaURtUXRyZ1BONTlYa0hQTldCRElUMVJtS0k4UEpLa0NnWUVBeDg2R1VudXRzWlVWUVhlekpXKzQKT3RqWitxeEtxemx5UTM1cWd2V3NjenFOYS9CWC94bHIxRis1VHRWckUrY3AyK3diZ25abnB6VGZJVVJLaTlFaQovdks1SmpvU2JqOHRhSU9mR2w2NnJ2MFNHQ1BtOUt3RzM0ZFYvZWEzUU5QaEMrb2tnL2J6MHFEZUhtSzJ3S2JsCkFISVh2SzFmWndBcjY2NzFsWmN3TjRrQ2dZQVFuN0licVdxdFI5TUFrOGNpdTNrT0ZLOUdDWFQ0NWtuSjRHeWIKemlSSzVsWSsrM28zWHV1RFRlT2w3cGVPWFdqZWtOcDdpN1pTUi9OclRhZ1IvV3NqUTV6RHdGUEs5N2twL0tobQowTFFNMlpiNjB4QzNnbW96MTM5Ylovam9wVUp2TWowem96VUVSekYzN3haTzlLaUN4QjdRcU5jWTVCak9Jc0dECi9VVkg2UUtCZ0hSdUtUeGhFZ1d4REh0bjhjQXJzTk1IVzQ3Tk9sbFlqVzZOS2g2b2hSOHBERmNYVTVXTmc0UXgKN2l1bUh3MExwZ2VmWnNXRm9CMmRMc2Z4UW9LMjRnZXI0WWdnUWVWNkhNVndrZ2tjMitTUlQxRDFnOUFBVmNhagpuS0xIb3h4MU5oY0UvWWQxeW1OT0UxdHd2Z1hSc2NUM3NSb0F3QWkydjlTYzhOU1pBRXFLCi0tLS0tRU5EIFJTQSBQUklWQVRFIEtFWS0tLS0tCg==
logLevel: INFO
protocol: mtls
runtime: runc
runk:
  namespace: ""
  dnsServers: []
  kubeconfigFile: ""
capacity:
  cpu: ""
  memory: ""
  pods: ""
  storage: ""
reservedResources:
  cpu: ""
  memory: ""
image:
  pullPolicy: ""
  defaultRegistry: ""
  registries: []
datastoreEndpoint: ""

What happend and What you expected to happen.


运行psi的时候报错
[root@root-kuscia-autonomy-alice-ecs-46f7 kuscia]# kubectl get kt owke-sntepnzb-node-35 -n cross-domain -o yaml
apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  annotations:
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/interconn-bfia-parties: ""
    kuscia.secretflow/interconn-kuscia-parties: bob
    kuscia.secretflow/interconn-self-parties: alice
    kuscia.secretflow/job-id: owke
    kuscia.secretflow/self-cluster-as-initiator: "true"
    kuscia.secretflow/task-alias: owke-sntepnzb-node-35
  creationTimestamp: "2024-07-16T10:04:29Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-uid: 39b4cf80-b7ec-45c4-ad83-875a5f3783e2
  name: owke-sntepnzb-node-35
  namespace: cross-domain
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: owke
    uid: 39b4cf80-b7ec-45c4-ad83-875a5f3783e2
  resourceVersion: "128652"
  uid: 25e2b0fc-4001-4dee-957d-d609e6c07e19
spec:
  initiator: alice
  parties:
  - appImageRef: secretflow-image
    domainID: bob
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: alice
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: |-
    {
      "sf_datasource_config": {
        "bob": {
          "id": "default-data-source"
        },
        "alice": {
          "id": "default-data-source"
        }
      },
      "sf_cluster_desc": {
        "parties": ["bob", "alice"],
        "devices": [{
          "name": "spu",
          "type": "spu",
          "parties": ["bob", "alice"],
          "config": "{\"runtime_config\":{\"protocol\":\"SEMI2K\",\"field\":\"FM128\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"
        }, {
          "name": "heu",
          "type": "heu",
          "parties": ["bob", "alice"],
          "config": "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"
        }],
        "ray_fed_config": {
          "cross_silo_comm_backend": "brpc_link"
        }
      },
      "sf_node_eval_param": {
        "domain": "data_prep",
        "name": "psi",
        "version": "0.0.5",
        "attr_paths": ["input/receiver_input/key", "input/sender_input/key", "protocol", "sort_result", "allow_duplicate_keys", "allow_duplicate_keys/no/skip_duplicates_check", "fill_value_int", "ecdh_curve"],
        "attrs": [{
          "is_na": false,
          "ss": ["id2"]
        }, {
          "is_na": false,
          "ss": ["id1"]
        }, {
          "is_na": false,
          "s": "PROTOCOL_RR22"
        }, {
          "b": true,
          "is_na": false
        }, {
          "is_na": false,
          "s": "no"
        }, {
          "is_na": true
        }, {
          "is_na": true
        }, {
          "is_na": false,
          "s": "CURVE_FOURQ"
        }],
        "inputs": [{
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "bob3_83915696.csv",
            "party": "alice",
            "format": "csv"
          }]
        }, {
          "type": "sf.table.individual",
          "meta": {
            "@type": "type.googleapis.com/secretflow.spec.v1.IndividualTable",
            "line_count": "-1"
          },
          "data_refs": [{
            "uri": "alice2_1240545269_2102488231.csv",
            "party": "bob",
            "format": "csv"
          }]
        }],
        "checkpoint_uri": "ckowke-sntepnzb-node-35-output-0"
      },
      "sf_output_uris": ["owke-sntepnzb-node-35-output-0"],
      "sf_input_ids": ["ufsrndbw", "napbzfhk"],
      "sf_output_ids": ["owke-sntepnzb-node-35-output-0"]
    }
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      owke-sntepnzb-node-35-0/client-server: 23856
      owke-sntepnzb-node-35-0/fed: 23858
      owke-sntepnzb-node-35-0/global: 23859
      owke-sntepnzb-node-35-0/node-manager: 23860
      owke-sntepnzb-node-35-0/object-manager: 23861
      owke-sntepnzb-node-35-0/spu: 23857
  completionTime: "2024-07-16T10:05:14Z"
  conditions:
  - lastTransitionTime: "2024-07-16T10:04:29Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2024-07-16T10:04:31Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2024-07-16T10:05:14Z"
    status: "False"
    type: Success
  lastReconcileTime: "2024-07-16T10:05:14Z"
  message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice],
    successful party[], failed party[bob]
  partyTaskStatus:
  - domainID: bob
    phase: Failed
  - domainID: alice
    phase: Failed
  phase: Failed
  podStatuses:
    alice/owke-sntepnzb-node-35-0:
      createTime: "2024-07-16T10:04:29Z"
      namespace: alice
      nodeName: root-kuscia-autonomy-alice-ecs-46f7
      podName: owke-sntepnzb-node-35-0
      podPhase: Failed
      readyTime: "2024-07-16T10:04:31Z"
      reason: OOMKilled
      startTime: "2024-07-16T10:04:30Z"
      terminationLog: 'container[secretflow] terminated state reason "OOMKilled",
        message: "WARNING:root:Since the GPL-licensed package `unidecode` is not installed,
        using Python''s `unicodedata` package which yields worse results.\n2024-07-16
        10:04:42,133|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address=''owke-sntepnzb-node-35-0-global.alice.svc'',
        ray_node_manager_port=23860, ray_object_manager_port=23861, ray_client_server_port=23856,
        ray_worker_ports=[], ray_gcs_port=23859)\n2024-07-16 10:04:42,133|alice|INFO|secretflow|entry.py:start_ray:67|
        Trying to start ray head node at owke-sntepnzb-node-35-0-global.alice.svc,
        start command: ray start --head --include-dashboard=false --disable-usage-stats
        --num-cpus=32 --node-ip-address=owke-sntepnzb-node-35-0-global.alice.svc --port=23859
        --node-manager-port=23860 --object-manager-port=23861 --ray-client-server-port=23856\n2024-07-16
        10:04:48,669|alice|CRITICAL|secretflow|entry.py:start_ray:75| Failed to start
        ray head node, start command: ray start --head --include-dashboard=false --disable-usage-stats
        --num-cpus=32 --node-ip-address=owke-sntepnzb-node-35-0-global.alice.svc --port=23859
        --node-manager-port=23860 --object-manager-port=23861 --ray-client-server-port=23856,
        stderr: b\"2024-07-16 10:04:46,672\\tWARNING services.py:1996 -- WARNING:
        The object store is using /tmp instead of /dev/shm because /dev/shm has only
        67108864 bytes available. This will harm performance! You may be able to free
        up space by deleting files in /dev/shm. If you are inside a Docker container,
        you can increase /dev/shm size by passing ''--shm-size=4.93gb'' to ''docker
        run'' (or add it to the run_options list in a Ray cluster config). Make sure
        to set this to more than 30% of available RAM.\\n\"\n2024-07-16 10:04:48,690|alice|CRITICAL|secretflow|entry.py:start_ray:76|
        This process will exit now!\n"'
  serviceStatuses:
    alice/owke-sntepnzb-node-35-0-fed:
      createTime: "2024-07-16T10:04:29Z"
      namespace: alice
      portName: fed
      portNumber: 23858
      readyTime: "2024-07-16T10:04:31Z"
      scope: Cluster
      serviceName: owke-sntepnzb-node-35-0-fed
    alice/owke-sntepnzb-node-35-0-global:
      createTime: "2024-07-16T10:04:29Z"
      namespace: alice
      portName: global
      portNumber: 23859
      readyTime: "2024-07-16T10:04:31Z"
      scope: Domain
      serviceName: owke-sntepnzb-node-35-0-global
    alice/owke-sntepnzb-node-35-0-spu:
      createTime: "2024-07-16T10:04:29Z"
      namespace: alice
      portName: spu
      portNumber: 23857
      readyTime: "2024-07-16T10:04:31Z"
      scope: Cluster
      serviceName: owke-sntepnzb-node-35-0-spu
  startTime: "2024-07-16T10:04:29Z"

[root@ecs-46f7 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.5G     0  7.5G   0% /dev
tmpfs           7.7G     0  7.7G   0% /dev/shm
tmpfs           7.7G   18M  7.7G   1% /run
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/vda2        39G   28G  9.0G  76% /
tmpfs           7.7G   64K  7.7G   1% /tmp
/dev/vda1      1022M  5.8M 1017M   1% /boot/efi
overlay          39G   28G  9.0G  76% /var/lib/docker/overlay2/0c16f58d56413cdd16e5de8a70df48b9b5297eec6bba71f6f28020a63069917c/merged
overlay          39G   28G  9.0G  76% /var/lib/docker/overlay2/71ea8c68e821b82958497bbf9fcd789cf0b6833a8a273e074957551938f579a9/merged
overlay          39G   28G  9.0G  76% /var/lib/docker/overlay2/ac61fe3b656c4bbfae107c21fe109ebe03d66224db3b8d75f31dc0e184c38290/merged
overlay          39G   28G  9.0G  76% /var/lib/docker/overlay2/939b60b3c97410e6bc150475b870cc360303938dc1a68e72aa233505bfc01173/merged
tmpfs           1.6G     0  1.6G   0% /run/user/0
[root@ecs-46f7 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          15760        3264        4515        6217        7980        4169
Swap:             0           0           0

[root@ecs-46f7 ~]# docker ps
CONTAINER ID   IMAGE                                                                          COMMAND                  CREATED        STATUS        PORTS                                                                                                                                                                            NAMES
fc1e00da810e   secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad:0.8.1b0   "/bin/sh -c 'java ${…"   39 hours ago   Up 16 hours   80/tcp, 9001/tcp, 0.0.0.0:8081->8080/tcp, :::8081->8080/tcp                                                                                                                      root-kuscia-autonomy-secretpad-bob
d33a6d3d1637   secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.9.0b0      "tini -- bin/kuscia …"   39 hours ago   Up 16 hours   0.0.0.0:13082->80/tcp, :::13082->80/tcp, 0.0.0.0:10081->1080/tcp, :::10081->1080/tcp, 0.0.0.0:40805->8082/tcp, :::40805->8082/tcp, 0.0.0.0:40804->8083/tcp, :::40804->8083/tcp   root-kuscia-autonomy-bob
80b6d9f65f48   secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretpad:0.8.1b0   "/bin/sh -c 'java ${…"   39 hours ago   Up 15 hours   80/tcp, 9001/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp                                                                                                                      root-kuscia-autonomy-secretpad-alice
27965ded9beb   secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.9.0b0      "tini -- bin/kuscia …"   39 hours ago   Up 15 hours   0.0.0.0:13081->80/tcp, :::13081->80/tcp, 0.0.0.0:10080->1080/tcp, :::10080->1080/tcp, 0.0.0.0:40802->8082/tcp, :::40802->8082/tcp, 0.0.0.0:40803->8083/tcp, :::40803->8083/tcp   root-kuscia-autonomy-alice

目前该机器部署了两个节点,是不是由于内存不够导致运行失败?
lanyy9527 commented 4 months ago

是的,根据提供的日志来看,alice节点报错提示是:OOMKilled 内存不足导致。 kuscia最少需要6g内存,如果使用docker环境运行,可以使用docker update --memory 调整内存资源

ruhengChen commented 4 months ago
2024-07-17 09 51 11@2x

你好,目前kuscia的docker内存我们都是调整过的

lanyy9527 commented 4 months ago

您这边跑psi任务的数据量有多大

ruhengChen commented 4 months ago

我刚开始用了百万的数据 也怀疑是数据量太大 但是后面我试了几十条 也不行

lanyy9527 commented 4 months ago

free -m 显示您的系统整体内存15G,我看还起了两个autonomy-secretpad容器,已经占系统内存的50%以上了,所以还是系统整体内存不足导致,您这边的场景是什么?如果只是单机体验kuscia psi的话,可以只保留两个autonomy节点再试下。

ruhengChen commented 4 months ago

我把两个 secretpad 都停了 ,然后跑了示例任务 还是失败了 docker exec -it ${USER}-kuscia-autonomy-alice scripts/user/create_example_job.sh

[root@root-kuscia-autonomy-alice-ecs-46f7 kuscia]# kubectl get kj -n cross-domain NAME STARTTIME COMPLETIONTIME LASTRECONCILETIME PHASE wilf 21h 20h 20h Failed xcwj 20h 20h 20h Failed pyqd 20h 20h 20h Failed eltr 18h 18h 18h Failed xzkt 16h 16h 16h Failed owke 16h 16h 16h Failed secretflow-task-20240717103430 66s 26s 26s Failed

[root@root-kuscia-autonomy-alice-ecs-46f7 kuscia]# kubectl get pods secretflow-task-20240717103430-single-psi-0 -o yaml -n alice

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kuscia.secretflow/config-template-volumes: config-template
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/task-id: secretflow-task-20240717103430-single-psi
    kuscia.secretflow/task-resource: secretflow-task-20240717103430-single-psi-f2238d0630f8
    kuscia.secretflow/task-resource-group: secretflow-task-20240717103430-single-psi
  creationTimestamp: "2024-07-17T02:34:31Z"
  labels:
    kuscia.secretflow/communication-role-client: "true"
    kuscia.secretflow/communication-role-server: "true"
    kuscia.secretflow/controller: kusciatask
    kuscia.secretflow/pod-identity: fa6189fc-6c7a-4826-8c41-797cf89c9417-0
    kuscia.secretflow/pod-role: ""
    kuscia.secretflow/task-resource-uid: 10a8fade-9da0-4d32-b1bf-fcab22828c3d
    kuscia.secretflow/task-uid: fa6189fc-6c7a-4826-8c41-797cf89c9417
  name: secretflow-task-20240717103430-single-psi-0
  namespace: alice
  resourceVersion: "220100"
  uid: ee29d7e2-55ec-45e5-882b-241b5d8a1913
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - -c
    - python -m secretflow.kuscia.entry ./kuscia/task-config.conf
    command:
    - sh
    env:
    - name: KUSCIA_DOMAIN_ID
      value: alice
    - name: TASK_ID
      value: secretflow-task-20240717103430-single-psi
    - name: TASK_CLUSTER_DEFINE
      value: '{"parties":[{"name":"alice", "role":"", "services":[{"portName":"spu",
        "endpoints":["secretflow-task-20240717103430-single-psi-0-spu.alice.svc"]},
        {"portName":"fed", "endpoints":["secretflow-task-20240717103430-single-psi-0-fed.alice.svc"]},
        {"portName":"global", "endpoints":["secretflow-task-20240717103430-single-psi-0-global.alice.svc:27493"]}]},
        {"name":"bob", "role":"", "services":[{"portName":"spu", "endpoints":["secretflow-task-20240717103430-single-psi-0-spu.bob.svc"]},
        {"portName":"fed", "endpoints":["secretflow-task-20240717103430-single-psi-0-fed.bob.svc"]},
        {"portName":"global", "endpoints":["secretflow-task-20240717103430-single-psi-0-global.bob.svc:20002"]}]}],
        "selfPartyIdx":0, "selfEndpointIdx":0}'
    - name: ALLOCATED_PORTS
      value: '{"ports":[{"name":"client-server", "port":27490, "scope":"Local", "protocol":"GRPC"},
        {"name":"spu", "port":27491, "scope":"Cluster", "protocol":"GRPC"}, {"name":"fed",
        "port":27492, "scope":"Cluster", "protocol":"GRPC"}, {"name":"global", "port":27493,
        "scope":"Domain", "protocol":"GRPC"}, {"name":"node-manager", "port":27494,
        "scope":"Local", "protocol":"GRPC"}, {"name":"object-manager", "port":27495,
        "scope":"Local", "protocol":"GRPC"}]}'
    - name: TASK_INPUT_CONFIG
      value: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{\"mode\":
        \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"psi","version":"0.0.5","attr_paths":["protocol","sort_result","allow_duplicate_keys","allow_duplicate_keys/yes/join_type","allow_duplicate_keys/yes/join_type/left_join/left_side","input/receiver_input/key","input/sender_input/key"],"attrs":[{"s":"PROTOCOL_ECDH"},{"b":true},{"s":"yes"},{"s":"left_join"},{"ss":["alice"]},{"ss":["id1"]},{"ss":["id2"]}]},"sf_input_ids":["alice-table","bob-table"],"sf_output_ids":["psi-output"],"sf_output_uris":["psi-output.csv"]}'
    - name: KUSCIA_PORT_CLIENT_SERVER_NUMBER
      value: "27490"
    - name: KUSCIA_PORT_SPU_NUMBER
      value: "27491"
    - name: KUSCIA_PORT_FED_NUMBER
      value: "27492"
    - name: KUSCIA_PORT_GLOBAL_NUMBER
      value: "27493"
    - name: KUSCIA_PORT_NODE_MANAGER_NUMBER
      value: "27494"
    - name: KUSCIA_PORT_OBJECT_MANAGER_NUMBER
      value: "27495"
    image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0
    imagePullPolicy: IfNotPresent
    name: secretflow
    ports:
    - containerPort: 27491
      name: spu
      protocol: TCP
    - containerPort: 27492
      name: fed
      protocol: TCP
    - containerPort: 27493
      name: global
      protocol: TCP
    - containerPort: 27494
      name: node-manager
      protocol: TCP
    - containerPort: 27495
      name: object-manager
      protocol: TCP
    - containerPort: 27490
      name: client-server
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /root/kuscia/task-config.conf
      name: config-template
      subPath: task-config.conf
    workingDir: /root
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: root-kuscia-autonomy-alice-ecs-46f7
  nodeSelector:
    kuscia.secretflow/namespace: alice
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: kuscia-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: kuscia.secretflow/agent
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: secretflow-task-20240717103430-single-psi-configtemplate
    name: config-template
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-07-17T02:34:32Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-07-17T02:34:56Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-07-17T02:34:56Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-07-17T02:34:32Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://098d23cc708c34c559264cad6def2d069277b86bc34c91737dd112dc4b6b81ec
    image: secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/secretflow-lite-anolis8:1.7.0b0
    imageID: sha256:96f7618d2c8e4c923e41451baa72cadbb9bfd1f365f4695e0beb31589b566d19
    lastState: {}
    name: secretflow
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://098d23cc708c34c559264cad6def2d069277b86bc34c91737dd112dc4b6b81ec
        exitCode: 137
        finishedAt: "2024-07-17T02:34:55Z"
        message: |
          WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
          2024-07-17 02:34:44,535|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='secretflow-task-20240717103430-single-psi-0-global.alice.svc', ray_node_manager_port=27494, ray_object_manager_port=27495, ray_client_server_port=27490, ray_worker_ports=[], ray_gcs_port=27493)
          2024-07-17 02:34:44,535|alice|INFO|secretflow|entry.py:start_ray:67| Trying to start ray head node at secretflow-task-20240717103430-single-psi-0-global.alice.svc, start command: ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=secretflow-task-20240717103430-single-psi-0-global.alice.svc --port=27493 --node-manager-port=27494 --object-manager-port=27495 --ray-client-server-port=27490
        reason: OOMKilled
        startedAt: "2024-07-17T02:34:33Z"
  hostIP: 172.18.0.6
  phase: Failed
  startTime: "2024-07-17T02:34:32Z"
ruhengChen commented 4 months ago
[root@ecs-46f7 ~]# docker ps
CONTAINER ID   IMAGE                                                                       COMMAND                  CREATED        STATUS        PORTS                                                                                                                                                                            NAMES
d33a6d3d1637   secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.9.0b0   "tini -- bin/kuscia …"   40 hours ago   Up 17 hours   0.0.0.0:13082->80/tcp, :::13082->80/tcp, 0.0.0.0:10081->1080/tcp, :::10081->1080/tcp, 0.0.0.0:40805->8082/tcp, :::40805->8082/tcp, 0.0.0.0:40804->8083/tcp, :::40804->8083/tcp   root-kuscia-autonomy-bob
27965ded9beb   secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia:0.9.0b0   "tini -- bin/kuscia …"   41 hours ago   Up 17 hours   0.0.0.0:13081->80/tcp, :::13081->80/tcp, 0.0.0.0:10080->1080/tcp, :::10080->1080/tcp, 0.0.0.0:40802->8082/tcp, :::40802->8082/tcp, 0.0.0.0:40803->8083/tcp, :::40803->8083/tcp   root-kuscia-autonomy-alice
[root@ecs-46f7 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          15760        1779       12094          16        1886       11802
Swap:             0           0           0
lanyy9527 commented 4 months ago

我这边找台机器试下,你那边如果有资源的话,可以先在两台机器上面分别布置一个节点来进行测试,一般我们如果在单机上面部署多节点的话,推荐系统内存是大于16G的

lanyy9527 commented 4 months ago
CONTAINER ID   NAME                         CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O         PIDS
e7d0f30ad94b   root-kuscia-autonomy-bob     3.73%     943.6MiB / 6GiB     15.36%    1.99MB / 1.92MB   217MB / 1.39GB    107
12b6b609da9d   root-kuscia-autonomy-alice   2.87%     975.2MiB / 6GiB     15.87%    1.92MB / 1.99MB   34.7MB / 1.92GB   116
^C
[root@iZbp143l1lire20uffx9t5Z data]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.8G     0  7.8G   0% /dev
tmpfs           7.8G     0  7.8G   0% /dev/shm
tmpfs           7.8G   13M  7.8G   1% /run
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/nvme0n1p2   40G   28G  9.5G  75% /
/dev/nvme0n1p1  191M  9.8M  182M   6% /boot/efi
tmpfs           1.6G     0  1.6G   0% /run/user/0
overlay          40G   28G  9.5G  75% /var/lib/docker/overlay2/7d786a32340c885cdbedcfe427cb205bdcbb8c265615b2cc8deb7b29f125022d/merged
overlay          40G   28G  9.5G  75% /var/lib/docker/overlay2/cdea1e7cca9eaafbaf643a03f956c5a654eecab3dba6caba5862fc2ae99f2c07/merged
[root@iZbp143l1lire20uffx9t5Z data]# free -m
              total        used        free      shared  buff/cache   available
Mem:          15906        1923       11298          12        2684       11632
Swap:             0           0           0
[root@iZbp143l1lire20uffx9t5Z data]# docker exec -it ${USER}-kuscia-autonomy-alice kubectl get kj -n cross-domain
NAME                             STARTTIME   COMPLETIONTIME   LASTRECONCILETIME   PHASE
secretflow-task-20240717131307   11m         11m              11m                 Succeeded
secretflow-task-20240717132151   3m          2m30s            2m30s               Succeeded

您好,这边在当前环境中测试是正常的,建议您排查系统中是否有其他进程正在运行导致内存占用的情况,也可以按照官网教程再尝试下呢。

ruhengChen commented 4 months ago

有没有可能是因为我们是arm系统的原因呢?

lanyy9527 commented 4 months ago

目前是支持arm的,(上面我发的测试结果也是在arm环境下执行的:Linux iZbp143l1lire20uffx9t5Z 4.18.0-348.20.1.el7.aarch64 #1 SMP Wed Apr 13 20:57:50 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux)。 升级下docker版本再试下呢,我这边用的是 26.1.4

ruhengChen commented 4 months ago

我这边重启了一下 docker, 可以了,非常感谢~