secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
72 stars 49 forks source link

在k8S Kuscia点对点runp集群中,用API创建JOB,split的Task报错 #373

Open wangzeyu135798 opened 2 months ago

wangzeyu135798 commented 2 months ago

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

centos7

Kuscia Version

kuscia0.8

Deployment

k8s

deployment Version

1.19

App Running type

secretflow

App Running version

1.7

Configuration file used to run kuscia.

1

What happend and What you expected to happen.

# 在容器内执行示例
export CTR_CERTS_ROOT=/home/kuscia/var/certs
curl -k -X POST 'http://localhost:8082/api/v1/job/create' \
 --header 'Content-Type: application/json' \
 --cert ${CTR_CERTS_ROOT}/kusciaapi-server.crt \
 --key ${CTR_CERTS_ROOT}/kusciaapi-server.key \
 --cacert ${CTR_CERTS_ROOT}/ca.crt \
 -d '{
  "job_id": "job-best-effort-linear",
  "initiator": "alice",
  "max_parallelism": 2,
  "tasks": [
    {
      "task_id": "job-psi-1",
      "app_image": "secretflow-image",
      "parties": [
        {
          "domain_id": "alice",
          "role": "partner"
        },
        {
          "domain_id": "bob",
          "role": "partner"
        }
      ],
      "alias": "job-psi-1",
      "dependencies": [],
      "task_input_config": "{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"psi\",\"version\":\"0.0.5\",\"attr_paths\":[\"protocol\",\"sort_result\",\"allow_duplicate_keys\",\"allow_duplicate_keys/yes/join_type\",\"allow_duplicate_keys/yes/join_type/left_join/left_side\",\"input/receiver_input/key\",\"input/sender_input/key\"],\"attrs\":[{\"s\":\"PROTOCOL_ECDH\"},{\"b\":true},{\"s\":\"yes\"},{\"s\":\"left_join\"},{\"ss\":[\"alice\"]},{\"ss\":[\"id1\"]},{\"ss\":[\"id2\"]}]},\"sf_input_ids\":[\"alice-table\",\"bob-table\"],\"sf_output_ids\":[\"psi-output-1\"],\"sf_output_uris\":[\"psi-output-1.csv\"]}",
      "priority": 100
    },
    {
      "task_id": "job-split-1",
      "app_image": "secretflow-image",
      "parties": [
        {
          "domain_id": "alice",
          "role": "partner"
        },
        {
          "domain_id": "bob",
          "role": "partner"
        }
      ],
      "alias": "job-split-1",
      "dependencies": [
        "job-psi-1"
      ],
      "task_input_config": "{\"sf_datasource_config\":{\"alice\":{\"id\":\"default-data-source\"},\"bob\":{\"id\":\"default-data-source\"}},\"sf_cluster_desc\":{\"parties\":[\"alice\",\"bob\"],\"devices\":[{\"name\":\"spu\",\"type\":\"spu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"runtime_config\\\":{\\\"protocol\\\":\\\"REF2K\\\",\\\"field\\\":\\\"FM64\\\"},\\\"link_desc\\\":{\\\"connect_retry_times\\\":60,\\\"connect_retry_interval_ms\\\":1000,\\\"brpc_channel_protocol\\\":\\\"http\\\",\\\"brpc_channel_connection_type\\\":\\\"pooled\\\",\\\"recv_timeout_ms\\\":1200000,\\\"http_timeout_ms\\\":1200000}}\"},{\"name\":\"heu\",\"type\":\"heu\",\"parties\":[\"alice\",\"bob\"],\"config\":\"{\\\"mode\\\": \\\"PHEU\\\", \\\"schema\\\": \\\"paillier\\\", \\\"key_size\\\": 2048}\"}],\"ray_fed_config\":{\"cross_silo_comm_backend\":\"brpc_link\"}},\"sf_node_eval_param\":{\"domain\":\"data_prep\",\"name\":\"train_test_split\",\"version\":\"0.0.1\",\"attr_paths\":[\"train_size\",\"test_size\",\"random_state\",\"shuffle\"],\"attrs\":[{\"f\":0.75},{\"f\":0.25},{\"i64\":1234},{\"b\":true}]},\"sf_output_uris\":[\"train-dataset-1.csv\",\"test-dataset-1.csv\"],\"sf_output_ids\":[\"train-dataset-1\",\"test-dataset-1\"],\"sf_input_ids\":[\"psi-output-1\"]}",
      "priority": 100
    }
  ]
}'

k8s  runp模式 直接调用split的api 报错

Kuscia log output.

apiVersion: kuscia.secretflow/v1alpha1
kind: KusciaTask
metadata:
  annotations:
    kuscia.secretflow/initiator: alice
    kuscia.secretflow/interconn-bfia-parties: ""
    kuscia.secretflow/interconn-kuscia-parties: bob
    kuscia.secretflow/interconn-self-parties: bob
    kuscia.secretflow/job-id: job-best-effort-linear
    kuscia.secretflow/party-master-domain: bob
    kuscia.secretflow/self-cluster-as-initiator: "false"
    kuscia.secretflow/task-alias: job-split-1
  creationTimestamp: "2024-07-10T07:26:37Z"
  generation: 1
  labels:
    kuscia.secretflow/controller: kuscia-job
    kuscia.secretflow/job-uid: c74e1f88-5b96-4d44-9480-626a251be466
  name: job-split-1
  namespace: cross-domain
  ownerReferences:
  - apiVersion: kuscia.secretflow/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: KusciaJob
    name: job-best-effort-linear
    uid: c74e1f88-5b96-4d44-9480-626a251be466
  resourceVersion: "267961"
  uid: 7a0a7ddd-524c-4061-8f5b-d67a97ad7f3d
spec:
  initiator: alice
  parties:
  - appImageRef: secretflow-image
    domainID: alice
    role: partner
    template:
      spec: {}
  - appImageRef: secretflow-image
    domainID: bob
    role: partner
    template:
      spec: {}
  scheduleConfig: {}
  taskInputConfig: '{"sf_datasource_config":{"alice":{"id":"default-data-source"},"bob":{"id":"default-data-source"}},"sf_cluster_desc":{"parties":["alice","bob"],"devices":[{"name":"spu","type":"spu","parties":["alice","bob"],"config":"{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}"},{"name":"heu","type":"heu","parties":["alice","bob"],"config":"{\"mode\":
    \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}"}],"ray_fed_config":{"cross_silo_comm_backend":"brpc_link"}},"sf_node_eval_param":{"domain":"data_prep","name":"train_test_split","version":"0.0.1","attr_paths":["train_size","test_size","random_state","shuffle"],"attrs":[{"f":0.75},{"f":0.25},{"i64":1234},{"b":true}]},"sf_output_uris":["train-dataset-1.csv","test-dataset-1.csv"],"sf_output_ids":["train-dataset-1","test-dataset-1"],"sf_input_ids":["psi-output-1"]}'
status:
  allocatedPorts:
  - domainID: alice
    namedPort:
      job-split-1-partner-0/client-server: 31317
      job-split-1-partner-0/fed: 31313
      job-split-1-partner-0/global: 31314
      job-split-1-partner-0/node-manager: 31315
      job-split-1-partner-0/object-manager: 31316
      job-split-1-partner-0/spu: 31318
    role: partner
  - domainID: bob
    namedPort:
      job-split-1-partner-0/client-server: 21793
      job-split-1-partner-0/fed: 21795
      job-split-1-partner-0/global: 21796
      job-split-1-partner-0/node-manager: 21797
      job-split-1-partner-0/object-manager: 21792
      job-split-1-partner-0/spu: 21794
    role: partner
  completionTime: "2024-07-10T07:26:56Z"
  conditions:
  - lastTransitionTime: "2024-07-10T07:26:37Z"
    status: "True"
    type: ResourceCreated
  - lastTransitionTime: "2024-07-10T07:26:40Z"
    status: "True"
    type: Running
  - lastTransitionTime: "2024-07-10T07:26:56Z"
    status: "False"
    type: Success
  lastReconcileTime: "2024-07-10T07:26:56Z"
  message: The remaining no-failed party task counts 1 are less than the threshold
    2 that meets the conditions for task success. pending party[], running party[alice-partner],
    successful party[], failed party[bob-partner]
  partyTaskStatus:
  - domainID: alice
    phase: Failed
    role: partner
  - domainID: bob
    phase: Failed
    role: partner
  phase: Failed
  podStatuses:
    bob/job-split-1-partner-0:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      nodeName: kuscia-autonomy-bob-69b9749d87-dpcql
      podName: job-split-1-partner-0
      podPhase: Failed
      readyTime: "2024-07-10T07:26:40Z"
      reason: Error
      startTime: "2024-07-10T07:26:40Z"
      terminationLog: 'container[secretflow] terminated state reason "Error", message:
        "... Ignore 12450 characters at the beginning ...\n)\x1b[0m 2024-07-10 15:26:48.319
        INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {''proxy_max_restarts'':
        3, ''timeout_in_ms'': 300000, ''recv_timeout_ms'': 604800000, ''connect_retry_times'':
        3600, ''connect_retry_interval_ms'': 1000, ''brpc_channel_protocol'': ''http'',
        ''brpc_channel_connection_type'': ''pooled'', ''exit_on_sending_failure'':
        True}\n\x1b[36m(SenderReceiverProxyActor pid=22634)\x1b[0m I0710 15:26:48.328158
        22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl]
        is serving on port=21795.\n\x1b[36m(SenderReceiverProxyActor pid=22634)\x1b[0m
        W0710 15:26:48.328190 22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1187]
        Builtin services are disabled according to ServerOptions.has_builtin_services\n\x1b[36m(SenderReceiverProxyActor
        pid=22634)\x1b[0m I0710 15:26:49.969624 22690 external/com_github_brpc_brpc/src/brpc/span.cpp:506]
        Opened ./rpc_data/rpcz/20240710.152649.22634/id.db and ./rpc_data/rpcz/20240710.152649.22634/time.db\n2024-07-10
        15:26:52.351 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create
        receiver proxy actor.\n2024-07-10 15:26:52.351 INFO barriers.py:520 [bob]
        -- [Anonymous_job] Try ping [''alice''] at 0 attemp, up to 3600 attemps.\n\x1b[36m(_run
        pid=22355)\x1b[0m WARNING:root:Since the GPL-licensed package `unidecode`
        is not installed, using Python''s `unicodedata` package which yields worse
        results.\n\x1b[33m(raylet)\x1b[0m [2024-07-10 15:26:47,745 I 22634 22634]
        logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL
        to -1\n2024-07-10 15:26:54.246 ERROR component.py:1129 [bob] -- [Anonymous_job]
        eval on domain: \"data_prep\"\nname: \"train_test_split\"\nversion: \"0.0.1\"\nattr_paths:
        \"train_size\"\nattr_paths: \"test_size\"\nattr_paths: \"random_state\"\nattr_paths:
        \"shuffle\"\nattrs {\n  f: 0.75\n}\nattrs {\n  f: 0.25\n}\nattrs {\n  i64:
        1234\n}\nattrs {\n  b: true\n}\ninputs {\n  name: \"psi-output-1.csv\"\n  type:
        \"sf.table.vertical_table\"\n  system_info {\n  }\n  meta {\n    type_url:
        \"type.googleapis.com/secretflow.spec.v1.VerticalTable\"\n    value: \"\\n\\335\\003\\n\\003id1\\022\\003age\\022\\teducation\\022\\007default\\022\\007balance\\022\\007housing\\022\\004loan\\022\\003day\\022\\010duration\\022\\010campaign\\022\\005pdays\\022\\010previous\\022\\017job_blue-collar\\022\\020job_entrepreneur\\022\\rjob_housemaid\\022\\016job_management\\022\\013job_retired\\022\\021job_self-employed\\022\\014job_services\\022\\013job_student\\022\\016job_technician\\022\\016job_unemployed\\022\\020marital_divorced\\022\\017marital_married\\022\\016marital_single\\\"\\003str*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float\\n\\227\\003\\n\\003id2\\022\\020contact_cellular\\022\\021contact_telephone\\022\\017contact_unknown\\022\\tmonth_apr\\022\\tmonth_aug\\022\\tmonth_dec\\022\\tmonth_feb\\022\\tmonth_jan\\022\\tmonth_jul\\022\\tmonth_jun\\022\\tmonth_mar\\022\\tmonth_may\\022\\tmonth_nov\\022\\tmonth_oct\\022\\tmonth_sep\\022\\020poutcome_failure\\022\\016poutcome_other\\022\\020poutcome_success\\022\\020poutcome_unknown\\022\\001y\\\"\\003str*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\005float*\\003int\\020\\244M\"\n  }\n  data_refs
        {\n    uri: \"psi-output-1.csv\"\n    party: \"alice\"\n    format: \"csv\"\n  }\n  data_refs
        {\n    uri: \"psi-output-1.csv\"\n    party: \"bob\"\n    format: \"csv\"\n  }\n}\noutput_uris:
        \"train-dataset-1.csv\"\noutput_uris: \"test-dataset-1.csv\"\n failed, error
        <\x1b[36mray::_run()\x1b[39m (pid=22355, ip=job-split-1-partner-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 382, in <lambda>\n    lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta(\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 49, in get_file_meta\n    return impl.get_file_meta(remote_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 208, in get_file_meta\n    assert os.path.exists(full_remote_fn)\nAssertionError>\n2024-07-10
        15:26:54.246 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly...\n2024-07-10
        15:26:54.246 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending.\n2024-07-10
        15:26:54.247 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message
        polling thread[DataSendingQueueThread] to exit.\n2024-07-10 15:26:54.248 INFO
        message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread]
        to exit.\n2024-07-10 15:26:54.248 INFO api.py:384 [bob] -- [Anonymous_job]
        Shutdowned rayfed.\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.10/runpy.py\",
        line 196, in _run_module_as_main\n    return _run_code(code, main_globals,
        None,\n  File \"/usr/local/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code,
        run_globals)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 547, in <module>\n    main()\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1078, in main\n    rv = self.invoke(ctx)\n  File \"/usr/local/lib/python3.10/site-packages/click/core.py\",
        line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File
        \"/usr/local/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n    return
        __callback(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py\",
        line 527, in main\n    res = comp_eval(sf_node_eval_param, storage_config,
        sf_cluster_config)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py\",
        line 166, in comp_eval\n    res = comp.eval(\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1131, in eval\n    raise e from None\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/component.py\",
        line 1126, in eval\n    ret = self.__eval_callback(ctx=ctx, **kwargs)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/train_test_split.py\",
        line 103, in train_test_split_eval_fn\n    input_df = load_table(\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 390, in load_table\n    file_metas = reveal(file_metas)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py\",
        line 162, in reveal\n    all_object = sfd.get(all_object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py\",
        line 156, in get\n    return fed.get(object_refs)\n  File \"/usr/local/lib/python3.10/site-packages/fed/api.py\",
        line 621, in get\n    values = ray.get(ray_refs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py\",
        line 22, in auto_init_wrapper\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py\",
        line 103, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/ray/_private/worker.py\",
        line 2624, in get\n    raise value.as_instanceof_cause()\nray.exceptions.RayTaskError(AssertionError):
        \x1b[36mray::_run()\x1b[39m (pid=22355, ip=job-split-1-partner-0-global.bob.svc)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py\",
        line 156, in _run\n    return fn(*args, **kwargs)\n  File \"/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py\",
        line 382, in <lambda>\n    lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta(\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py\",
        line 49, in get_file_meta\n    return impl.get_file_meta(remote_fn)\n  File
        \"/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py\",
        line 208, in get_file_meta\n    assert os.path.exists(full_remote_fn)\nAssertionError\n"'
  serviceStatuses:
    bob/job-split-1-partner-0-fed:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      portName: fed
      portNumber: 21795
      readyTime: "2024-07-10T07:26:40Z"
      scope: Cluster
      serviceName: job-split-1-partner-0-fed
    bob/job-split-1-partner-0-global:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      portName: global
      portNumber: 21796
      readyTime: "2024-07-10T07:26:40Z"
      scope: Domain
      serviceName: job-split-1-partner-0-global
    bob/job-split-1-partner-0-spu:
      createTime: "2024-07-10T07:26:37Z"
      namespace: bob
      portName: spu
      portNumber: 21794
      readyTime: "2024-07-10T07:26:40Z"
      scope: Cluster
      serviceName: job-split-1-partner-0-spu
  startTime: "2024-07-10T07:26:37Z"
wangzeyu135798 commented 2 months ago

ad1a304f082e2facfa83626043d1795

zimu-yuxi commented 2 months ago

方便提供下另一方的报错吗?

wangzeyu135798 commented 2 months ago

另一个方没有报错,日志如下: [root@kuscia-autonomy-alice-56db7f7ffc-9gzsl logs]# kubectl get kt job-split-1 -oyaml -n cross-domain apiVersion: kuscia.secretflow/v1alpha1 kind: KusciaTask metadata: annotations: kuscia.secretflow/initiator: alice kuscia.secretflow/interconn-bfia-parties: "" kuscia.secretflow/interconn-kuscia-parties: bob kuscia.secretflow/interconn-self-parties: alice kuscia.secretflow/job-id: job-best-effort-linear kuscia.secretflow/self-cluster-as-initiator: "true" kuscia.secretflow/task-alias: job-split-1 creationTimestamp: "2024-07-10T07:26:37Z" generation: 1 labels: kuscia.secretflow/controller: kuscia-job kuscia.secretflow/job-uid: bdf0116a-e7f5-4673-981a-8281857c059a name: job-split-1 namespace: cross-domain ownerReferences:

wangzeyu135798 commented 2 months ago

WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. 2024-07-10 15:26:43,016|bob|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='job-split-1-partner-0-global.bob.svc', ray_node_manager_port=21797, ray_object_manager_port=21792, ray_client_server_port=21793, ray_worker_ports=[], ray_gcs_port=21796) 2024-07-10 15:26:43,016|bob|INFO|secretflow|entry.py:start_ray:63| Trying to start ray head node at job-split-1-partner-0-global.bob.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=8 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=job-split-1-partner-0-global.bob.svc --port=21796 --node-manager-port=21797 --object-manager-port=21792 --ray-client-server-port=21793 2024-07-10 15:26:46,577|bob|INFO|secretflow|entry.py:start_ray:80| 2024-07-10 15:26:43,625 INFO usage_lib.py:423 -- Usage stats collection is disabled. 2024-07-10 15:26:43,626 INFO scripts.py:744 -- Local node IP: job-split-1-partner-0-global.bob.svc 2024-07-10 15:26:46,415 SUCC scripts.py:781 -- -------------------- 2024-07-10 15:26:46,415 SUCC scripts.py:782 -- Ray runtime started. 2024-07-10 15:26:46,415 SUCC scripts.py:783 -- -------------------- 2024-07-10 15:26:46,415 INFO scripts.py:785 -- Next steps 2024-07-10 15:26:46,415 INFO scripts.py:788 -- To add another node to this Ray cluster, run 2024-07-10 15:26:46,415 INFO scripts.py:791 -- ray start --address='job-split-1-partner-0-global.bob.svc:21796' 2024-07-10 15:26:46,416 INFO scripts.py:800 -- To connect to this Ray cluster: 2024-07-10 15:26:46,416 INFO scripts.py:802 -- import ray 2024-07-10 15:26:46,416 INFO scripts.py:803 -- ray.init(_node_ip_address='job-split-1-partner-0-global.bob.svc') 2024-07-10 15:26:46,416 INFO scripts.py:834 -- To terminate the Ray runtime, run 2024-07-10 15:26:46,416 INFO scripts.py:835 -- ray stop 2024-07-10 15:26:46,416 INFO scripts.py:838 -- To view the status of the cluster, use 2024-07-10 15:26:46,416 INFO scripts.py:839 -- ray status

2024-07-10 15:26:46,577|bob|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at job-split-1-partner-0-global.bob.svc. 2024-07-10 15:26:46,578|bob|INFO|secretflow|entry.py:main:510| datasource.access_directly True sf_node_eval_param { "domain": "data_prep", "name": "train_test_split", "version": "0.0.1", "attrPaths": [ "train_size", "test_size", "random_state", "shuffle" ], "attrs": [ { "f": 0.75 }, { "f": 0.25 }, { "i64": "1234" }, { "b": true } ] } 2024-07-10 15:26:46,585|bob|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id psi-output-1 to ........... name: "psi-output-1.csv" type: "sf.table.vertical_table" system_info { } meta { type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable" value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\244M" } data_refs { uri: "psi-output-1.csv" party: "alice" format: "csv" } data_refs { uri: "psi-output-1.csv" party: "bob" format: "csv" }

.... 2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:159|

Secretflow 1.6.0b0 Build time (May 21 2024, 06:18:47) with commit id: ba76e1fe43cf3daa0c91423a660f318810c88030

2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:160|

param

domain: "data_prep" name: "train_test_split" version: "0.0.1" attr_paths: "train_size" attr_paths: "test_size" attr_paths: "random_state" attr_paths: "shuffle" attrs { f: 0.75 } attrs { f: 0.25 } attrs { i64: 1234 } attrs { b: true } inputs { name: "psi-output-1.csv" type: "sf.table.vertical_table" system_info { } meta { type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable" value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\244M" } data_refs { uri: "psi-output-1.csv" party: "alice" format: "csv" } data_refs { uri: "psi-output-1.csv" party: "bob" format: "csv" } } output_uris: "train-dataset-1.csv" output_uris: "test-dataset-1.csv"

--

2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:161|

storage_config

type: "local_fs" local_fs { wd: "/home/kuscia/var/storage/data" }

--

2024-07-10 15:26:46,585|bob|WARNING|secretflow|entry.py:comp_eval:162|

cluster_config

desc { parties: "alice" parties: "bob" devices { name: "spu" type: "spu" parties: "alice" parties: "bob" config: "{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}" } devices { name: "heu" type: "heu" parties: "alice" parties: "bob" config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}" } ray_fed_config { cross_silo_comm_backend: "brpc_link" } } public_config { ray_fed_config { parties: "alice" parties: "bob" addresses: "job-split-1-partner-0-fed.alice.svc:80" addresses: "0.0.0.0:21795" } spu_configs { name: "spu" parties: "alice" parties: "bob" addresses: "http://job-split-1-partner-0-spu.alice.svc:80" addresses: "0.0.0.0:21794" } } private_config { self_party: "bob" ray_head_addr: "job-split-1-partner-0-global.bob.svc:21796" }

--

2024-07-10 15:26:46,586|bob|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment. 2024-07-10 15:26:46,586 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: job-split-1-partner-0-global.bob.svc:21796... 2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock 2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377488 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock 2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock 2024-07-10 15:26:46,593|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377488 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/node_ip_address.json.lock 2024-07-10 15:26:46,595|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377536 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377536 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377536 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377536 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377440 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377440 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377440 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377440 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377632 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377632 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377632 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,596|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377632 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:acquire:297| Lock 140181644377488 acquired on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140181644377488 on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,597|bob|DEBUG|secretflow|_api.py:release:330| Lock 140181644377488 released on /tmp/ray/session_2024-07-10_15-26-43_626562_22147/ports_by_node.json.lock 2024-07-10 15:26:46,597 INFO worker.py:1724 -- Connected to Ray cluster. 2024-07-10 15:26:47.306 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice': 'http://job-split-1-partner-0-fed.alice.svc:80', 'bob': '0.0.0.0:21795'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}} (raylet) [2024-07-10 15:26:47,275 I 22355 22355] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 (SenderReceiverProxyActor pid=22634) 2024-07-10 15:26:48.319 INFO link.py:38 [bob] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True} (SenderReceiverProxyActor pid=22634) I0710 15:26:48.328158 22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=21795. (SenderReceiverProxyActor pid=22634) W0710 15:26:48.328190 22634 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services (SenderReceiverProxyActor pid=22634) I0710 15:26:49.969624 22690 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240710.152649.22634/id.db and ./rpc_data/rpcz/20240710.152649.22634/time.db 2024-07-10 15:26:52.351 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor. 2024-07-10 15:26:52.351 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps. (_run pid=22355) WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. (raylet) [2024-07-10 15:26:47,745 I 22634 22634] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 2024-07-10 15:26:54.246 ERROR component.py:1129 [bob] -- [Anonymous_job] eval on domain: "data_prep" name: "train_test_split" version: "0.0.1" attr_paths: "train_size" attr_paths: "test_size" attr_paths: "random_state" attr_paths: "shuffle" attrs { f: 0.75 } attrs { f: 0.25 } attrs { i64: 1234 } attrs { b: true } inputs { name: "psi-output-1.csv" type: "sf.table.vertical_table" system_info { } meta { type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable" value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\244M" } data_refs { uri: "psi-output-1.csv" party: "alice" format: "csv" } data_refs { uri: "psi-output-1.csv" party: "bob" format: "csv" } } output_uris: "train-dataset-1.csv" output_uris: "test-dataset-1.csv" failed, error <ray::_run() (pid=22355, ip=job-split-1-partner-0-global.bob.svc) File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run return fn(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 382, in lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta( File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 49, in get_file_meta return impl.get_file_meta(remote_fn) File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 208, in get_file_meta assert os.path.exists(full_remote_fn) AssertionError> 2024-07-10 15:26:54.246 INFO api.py:342 [bob] -- [Anonymous_job] Shutdowning rayfed intendedly... 2024-07-10 15:26:54.246 INFO api.py:356 [bob] -- [Anonymous_job] No wait for data sending. 2024-07-10 15:26:54.247 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit. 2024-07-10 15:26:54.248 INFO message_queue.py:72 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit. 2024-07-10 15:26:54.248 INFO api.py:384 [bob] -- [Anonymous_job] Shutdowned rayfed. Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in main() File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 527, in main res = comp_eval(sf_node_eval_param, storage_config, sf_cluster_config) File "/usr/local/lib/python3.10/site-packages/secretflow/component/entry.py", line 166, in comp_eval res = comp.eval( File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1131, in eval raise e from None File "/usr/local/lib/python3.10/site-packages/secretflow/component/component.py", line 1126, in eval ret = self.__eval_callback(ctx=ctx, kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/component/preprocessing/data_prep/train_test_split.py", line 103, in train_test_split_eval_fn input_df = load_table( File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 390, in load_table file_metas = reveal(file_metas) File "/usr/local/lib/python3.10/site-packages/secretflow/device/driver.py", line 162, in reveal all_object = sfd.get(all_object_refs) File "/usr/local/lib/python3.10/site-packages/secretflow/distributed/primitive.py", line 156, in get return fed.get(object_refs) File "/usr/local/lib/python3.10/site-packages/fed/api.py", line 621, in get values = ray.get(ray_refs) File "/usr/local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, *kwargs) File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, kwargs) File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AssertionError): ray::_run() (pid=22355, ip=job-split-1-partner-0-global.bob.svc) File "/usr/local/lib/python3.10/site-packages/secretflow/device/device/pyu.py", line 156, in _run return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/secretflow/component/data_utils.py", line 382, in lambda uri=parties_path_format[p].uri: ctx.comp_storage.get_file_meta( File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/storage.py", line 49, in get_file_meta return impl.get_file_meta(remote_fn) File "/usr/local/lib/python3.10/site-packages/secretflow/component/storage/impl/storage_impl.py", line 208, in get_file_meta assert os.path.exists(full_remote_fn) AssertionError

wangzeyu135798 commented 2 months ago

WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. 2024-07-10 15:26:43,898|alice|INFO|secretflow|entry.py:start_ray:59| ray_conf: RayConfig(ray_node_ip_address='job-split-1-partner-0-global.alice.svc', ray_node_manager_port=25098, ray_object_manager_port=25099, ray_client_server_port=25100, ray_worker_ports=[], ray_gcs_port=25097) 2024-07-10 15:26:43,898|alice|INFO|secretflow|entry.py:start_ray:63| Trying to start ray head node at job-split-1-partner-0-global.alice.svc, start command: RAY_BACKEND_LOG_LEVEL=debug RAY_grpc_enable_http_proxy=true OMP_NUM_THREADS=4 ray start --head --include-dashboard=false --disable-usage-stats --num-cpus=32 --node-ip-address=job-split-1-partner-0-global.alice.svc --port=25097 --node-manager-port=25098 --object-manager-port=25099 --ray-client-server-port=25100 2024-07-10 15:26:47,658|alice|INFO|secretflow|entry.py:start_ray:80| 2024-07-10 15:26:44,592 INFO usage_lib.py:423 -- Usage stats collection is disabled. 2024-07-10 15:26:44,592 INFO scripts.py:744 -- Local node IP: job-split-1-partner-0-global.alice.svc 2024-07-10 15:26:47,518 SUCC scripts.py:781 -- -------------------- 2024-07-10 15:26:47,518 SUCC scripts.py:782 -- Ray runtime started. 2024-07-10 15:26:47,518 SUCC scripts.py:783 -- -------------------- 2024-07-10 15:26:47,518 INFO scripts.py:785 -- Next steps 2024-07-10 15:26:47,518 INFO scripts.py:788 -- To add another node to this Ray cluster, run 2024-07-10 15:26:47,518 INFO scripts.py:791 -- ray start --address='job-split-1-partner-0-global.alice.svc:25097' 2024-07-10 15:26:47,518 INFO scripts.py:800 -- To connect to this Ray cluster: 2024-07-10 15:26:47,518 INFO scripts.py:802 -- import ray 2024-07-10 15:26:47,519 INFO scripts.py:803 -- ray.init(_node_ip_address='job-split-1-partner-0-global.alice.svc') 2024-07-10 15:26:47,519 INFO scripts.py:834 -- To terminate the Ray runtime, run 2024-07-10 15:26:47,519 INFO scripts.py:835 -- ray stop 2024-07-10 15:26:47,519 INFO scripts.py:838 -- To view the status of the cluster, use 2024-07-10 15:26:47,519 INFO scripts.py:839 -- ray status

2024-07-10 15:26:47,658|alice|INFO|secretflow|entry.py:start_ray:81| Succeeded to start ray head node at job-split-1-partner-0-global.alice.svc. 2024-07-10 15:26:47,659|alice|INFO|secretflow|entry.py:main:510| datasource.access_directly True sf_node_eval_param { "domain": "data_prep", "name": "train_test_split", "version": "0.0.1", "attrPaths": [ "train_size", "test_size", "random_state", "shuffle" ], "attrs": [ { "f": 0.75 }, { "f": 0.25 }, { "i64": "1234" }, { "b": true } ] } 2024-07-10 15:26:47,667|alice|INFO|secretflow|entry.py:domaindata_id_to_dist_data:160| domaindata_id psi-output-1 to ........... name: "psi-output-1.csv" type: "sf.table.vertical_table" system_info { } meta { type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable" value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\244M" } data_refs { uri: "psi-output-1.csv" party: "alice" format: "csv" } data_refs { uri: "psi-output-1.csv" party: "bob" format: "csv" }

.... 2024-07-10 15:26:47,667|alice|WARNING|secretflow|entry.py:comp_eval:159|

Secretflow 1.6.0b0 Build time (May 21 2024, 06:18:47) with commit id: ba76e1fe43cf3daa0c91423a660f318810c88030

2024-07-10 15:26:47,667|alice|WARNING|secretflow|entry.py:comp_eval:160|

param

domain: "data_prep" name: "train_test_split" version: "0.0.1" attr_paths: "train_size" attr_paths: "test_size" attr_paths: "random_state" attr_paths: "shuffle" attrs { f: 0.75 } attrs { f: 0.25 } attrs { i64: 1234 } attrs { b: true } inputs { name: "psi-output-1.csv" type: "sf.table.vertical_table" system_info { } meta { type_url: "type.googleapis.com/secretflow.spec.v1.VerticalTable" value: "\n\335\003\n\003id1\022\003age\022\teducation\022\007default\022\007balance\022\007housing\022\004loan\022\003day\022\010duration\022\010campaign\022\005pdays\022\010previous\022\017job_blue-collar\022\020job_entrepreneur\022\rjob_housemaid\022\016job_management\022\013job_retired\022\021job_self-employed\022\014job_services\022\013job_student\022\016job_technician\022\016job_unemployed\022\020marital_divorced\022\017marital_married\022\016marital_single\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\n\227\003\n\003id2\022\020contact_cellular\022\021contact_telephone\022\017contact_unknown\022\tmonth_apr\022\tmonth_aug\022\tmonth_dec\022\tmonth_feb\022\tmonth_jan\022\tmonth_jul\022\tmonth_jun\022\tmonth_mar\022\tmonth_may\022\tmonth_nov\022\tmonth_oct\022\tmonth_sep\022\020poutcome_failure\022\016poutcome_other\022\020poutcome_success\022\020poutcome_unknown\022\001y\"\003str\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\005float\003int\020\244M" } data_refs { uri: "psi-output-1.csv" party: "alice" format: "csv" } data_refs { uri: "psi-output-1.csv" party: "bob" format: "csv" } } output_uris: "train-dataset-1.csv" output_uris: "test-dataset-1.csv"

--

2024-07-10 15:26:47,667|alice|WARNING|secretflow|entry.py:comp_eval:161|

storage_config

type: "local_fs" local_fs { wd: "/home/kuscia/var/storage/data" }

--

2024-07-10 15:26:47,668|alice|WARNING|secretflow|entry.py:comp_eval:162|

cluster_config

desc { parties: "alice" parties: "bob" devices { name: "spu" type: "spu" parties: "alice" parties: "bob" config: "{\"runtime_config\":{\"protocol\":\"REF2K\",\"field\":\"FM64\"},\"link_desc\":{\"connect_retry_times\":60,\"connect_retry_interval_ms\":1000,\"brpc_channel_protocol\":\"http\",\"brpc_channel_connection_type\":\"pooled\",\"recv_timeout_ms\":1200000,\"http_timeout_ms\":1200000}}" } devices { name: "heu" type: "heu" parties: "alice" parties: "bob" config: "{\"mode\": \"PHEU\", \"schema\": \"paillier\", \"key_size\": 2048}" } ray_fed_config { cross_silo_comm_backend: "brpc_link" } } public_config { ray_fed_config { parties: "alice" parties: "bob" addresses: "0.0.0.0:25102" addresses: "job-split-1-partner-0-fed.bob.svc:80" } spu_configs { name: "spu" parties: "alice" parties: "bob" addresses: "0.0.0.0:25101" addresses: "http://job-split-1-partner-0-spu.bob.svc:80" } } private_config { self_party: "alice" ray_head_addr: "job-split-1-partner-0-global.alice.svc:25097" }

--

2024-07-10 15:26:47,668|alice|WARNING|secretflow|driver.py:init:442| When connecting to an existing cluster, num_cpus must not be provided. Num_cpus is neglected at this moment. 2024-07-10 15:26:47,669 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: job-split-1-partner-0-global.alice.svc:25097... 2024-07-10 15:26:47,675|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock 2024-07-10 15:26:47,676|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753376 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock 2024-07-10 15:26:47,676|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock 2024-07-10 15:26:47,676|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753376 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/node_ip_address.json.lock 2024-07-10 15:26:47,679|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753424 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,679|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753424 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,679|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753424 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,680|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753424 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,680|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753328 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,680|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753328 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753328 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753328 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753520 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753520 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753520 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753520 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,681|alice|DEBUG|secretflow|_api.py:acquire:294| Attempting to acquire lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,682|alice|DEBUG|secretflow|_api.py:acquire:297| Lock 140343807753376 acquired on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,682|alice|DEBUG|secretflow|_api.py:release:327| Attempting to release lock 140343807753376 on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,682|alice|DEBUG|secretflow|_api.py:release:330| Lock 140343807753376 released on /tmp/ray/session_2024-07-10_15-26-44_593387_9540/ports_by_node.json.lock 2024-07-10 15:26:47,682 INFO worker.py:1724 -- Connected to Ray cluster. 2024-07-10 15:26:48.575 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice': '0.0.0.0:25102', 'bob': 'http://job-split-1-partner-0-fed.bob.svc:80'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}} (raylet) [2024-07-10 15:26:49,120 I 10119 10119] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 (SenderReceiverProxyActor pid=10119) 2024-07-10 15:26:49.883 INFO link.py:38 [alice] -- [Anonymous_job] brpc options: {'proxy_max_restarts': 3, 'timeout_in_ms': 300000, 'recv_timeout_ms': 604800000, 'connect_retry_times': 3600, 'connect_retry_interval_ms': 1000, 'brpc_channel_protocol': 'http', 'brpc_channel_connection_type': 'pooled', 'exit_on_sending_failure': True} (SenderReceiverProxyActor pid=10119) I0710 15:26:49.891576 10119 external/com_github_brpc_brpc/src/brpc/server.cpp:1181] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=25102. (SenderReceiverProxyActor pid=10119) W0710 15:26:49.891611 10119 external/com_github_brpc_brpc/src/brpc/server.cpp:1187] Builtin services are disabled according to ServerOptions.has_builtin_services 2024-07-10 15:26:52.346 INFO barriers.py:465 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor. 2024-07-10 15:26:52.347 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps. (SenderReceiverProxyActor pid=10119) I0710 15:26:52.410292 10214 external/com_github_brpc_brpc/src/brpc/span.cpp:506] Opened ./rpc_data/rpcz/20240710.152652.10119/id.db and ./rpc_data/rpcz/20240710.152652.10119/time.db (_run pid=9847) WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results.

aokaokd commented 2 months ago

你到bob的容器内,看下psi-output-1.csv 这个文件生成了吗

github-actions[bot] commented 1 month ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.