secretflow / kuscia

Kuscia(Kubernetes-based Secure Collaborative InfrA) is a K8s-based privacy-preserving computing task orchestration framework.
https://www.secretflow.org.cn/docs/kuscia/latest/zh-Hans
Apache License 2.0
73 stars 55 forks source link

Docker多机部署点对点集群,运行任务时报错 #414

Closed shnnosuke34725 closed 1 month ago

shnnosuke34725 commented 2 months ago

Issue Type

Running

Search for existing issues similar to yours

Yes

OS Platform and Distribution

Linux Ubuntu 22.04

Kuscia Version

kuscia v0.11.0b0

Deployment

docker

deployment Version

docker 24.0.5

App Running type

secretflow

App Running version

secretflow-lite-anolis8:1.7.0b0

Configuration file used to run kuscia.

#alice-autonomy(根据文档的流程使用的是默认的配置)
mode: autonomy
domainID: alice
domainKeyData: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFb2dJQkFBS0NBUUVBcWRSUkpRK1JkRUxRa2ppazBiVllhWWZ4T1l6a3I4WXJ4SE42NDhHNW9nNnlVSDFsClNzVm8vaGgwRmdxSmROZFB5Mk9qMXNGS1V4YmFqT1BFblUzaWNQV2w5RVZoTFJEMWJFajM4Z1gva1N6UUtQeC8KNkFaNFYwaXdaK2llRnA5d0xTMXBnMmxsOVZwK2pKK0pTNUkyOG9hNVQycW90SXVLVit6WkNFVnpTWFc0MHpaSwpicStmeGZUdzFTUEZpcWtRM052d2lhcGpmZDFpY1VqdFNMVW9FVm5KQ1ZmdGpHTlhpM0p5QngvdXAwTzZHTmRFCjd1VTU3OXFIeTJZbXZtNDUwUXRLNG40TEhjS2lMdzhBS2hEaWJMbGtVRWhpOVhUYXFMVWpNbVpIczVWM3pXOE4KUElpaGhkbGN5b1ZoVzZrdFNSRE10YU90UklvY3pHU21jdWlwNndJREFRQUJBb0lCQUhadFpVeUh4N0dnS2h2ZApQaW95NEgxdTIrdDY4Ym9WWWszekRYNG5pSkNXMlFmQitkR2pTZXp2Rm55TVNvQmM2UHIyOTdoNVA2QWpicklTCjN2ZW02VUpHT3J6VmFNZHBiUXRlOHZBbCtLcSs2a1c2bG1NeHA5ZU9DOTNaMit3QXNOUUFOL1Q0bWEzM3RnblAKOG9qdFpEM0pidzRQWGFmUkt0N1hmaDBEZVRwK3BvTllSSnF0UXlXZVQxU2g5cnMvUEZ6djVuUnYyOGU3VFcycApvTUhmb3BySFZJQ3p4YlprbnlsSmtqUEUydVloYXlTNE5oaUxhTXpzYlpZekwzUjA3VndrMnpIb1U4dVJEWHNxCjNmYWJ2OGpLWEJwcUFHNG9SS2ZXWS83blk1ejdDcGhHWklsSk1xOW5ZalQreXdMR3NIU2N1ZEcweUJlNU5rL1AKYXFjc1dra0NnWUVBd1JlZ2NRaWNwaXRiYlBPVnNnSC9zem43ajc5RTJYSjVqcFNIalFIWUhub2xlRzBqbDYxOApsUVcxbWhkazlLbjZkV3MzSFd5enp4V0VsME9IMlY2aENBczBqbCt6eFpOVHNGNTdIajVueDc1RWFMYWJRdlFxCnA2dXJOTldHS25rdnFkNWREdWlzbk93VkVYM01DSzlIbE5rT2dzY1hGNlc5NXdPOGgvWUhDdDBDZ1lFQTRTaUIKTDJKUGdkZCtEbWR3RnJoUWwrWFdQS2lCU2RvbFJJaWVYUkhpK21XT3Q4MmRaeURlc2kxb040c0tjVzFwaHNteApnT3hSa3d4YzBNV0cyZi9mSEFWekh1ZWlJRjV6UTErTHl3Tkl3K1hsUFhMZUcwVlJWTUtUQ0NtSXpIZWwxc25NCkVGVkFDUUxvc0tvbDRMMUVKdW5LL0xNaithVlBpTko5TlVGUVIyY0NnWUJWSTdiUndFdGFGYW9GVzA0NUpCcDgKQzJmNWxRdWxtWTB4cWhvdXVZNXl1Y2NGMTVHbkVvN3BJcEJWZGxWRWNDS0lYWkw2dlhCM01mUzV3Y1FIdTJyagpvaFUxWmN0ZHBiMXorZVR0aS9TMHBSZUMyR21UVnhmcndJMElDZEpUcmdXdkwrWDJhZSthYlpwSWtTQkRBQTVlCittb2tqZWFIdmNRRE5hbU9oWlBMWFFLQmdFdmc0SmhkWXpuNHEweWpZMHprMUpROEtwVEtuTGVNd3A1MEJCcU4KV3BiVC91TEdjbE04Nm8vVmFaZStUY2luL0xZbDVxSHlBaE95U04wNmxCV0hlMkx3R3puQkNnd3FpR0dlSTNoSgpKUTZQdlUrV0ZHL1FUblpvRkREZC9uSVpxRlBZTWVNWE43dFJ0YVZEMGZ3SkRKeW9rWFhUMFQzaWpna29GbllLCkNzbmxBb0dBVDJOZUJYRHNXTmRUelpmR2xyRmJoRmtaNTBuaStjOThLb2JIV0JtZTBsZDdjMFNsdXhQTzN0YmgKaXhPUW1JTWc3YmVPZ0wwV1dIQkEzb1dWaWRRMkRvWEpkYUpHMDc1UUNrRE10MzZLWHdWZHVFTkZ2Qmg2WlY4egp1cjA2cmZWZ3pYR2lJamZkT29BS3BHQ2xudjIrTWdWbk1PMHFqeUcxYTJ6T1d2cVI3Um89Ci0tLS0tRU5EIFJTQSBQUklWQVRFIEtFWS0tLS0tCg==
logLevel: INFO
runtime: runc
runk:
  namespace: ""
  dnsServers: []
  kubeconfigFile: ""
capacity:
  cpu: ""
  memory: ""
  pods: ""
  storage: ""
reservedResources:
  cpu: ""
  memory: ""
image:
  pullPolicy: ""
  defaultRegistry: ""
  registries: []
datastoreEndpoint: ""

What happend and What you expected to happen.

文档:https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/deployment/Docker_deployment_kuscia/deploy_p2p_cn
按照其内容执行到最后一步:docker exec -it ${USER}-kuscia-autonomy-alice scripts/user/create_example_job.sh时,查看作业状态发现作业运行失败,经过检查两台机器的网络通信没有问题,也尝试过更换kuscia版本v0.10.0b0和v0.9.0b0。
另外,在两台Ubuntu18.04的机器上执行相同的操作是没有问题的。

Kuscia log output.

state:
      terminated:
        containerID: containerd://c82832003fc68eb90814d57580589a95fa93107887116cb6fe1271ad7bc62075
        exitCode: 1
        finishedAt: "2024-08-27T01:54:37Z"
        message: |+
          WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
          Traceback (most recent call last):
            File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
              return _run_code(code, main_globals, None,
            File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
              exec(code, run_globals)
            File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in <module>
              main()
            File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
              return self.main(*args, **kwargs)
            File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
              rv = self.invoke(ctx)
            File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
              return ctx.invoke(self.callback, **ctx.params)
            File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
              return __callback(*args, **kwargs)
            File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 502, in main
              datasource = get_domain_data_source(datasource_stub, datasource_id)
            File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/datamesh.py", line 115, in get_domain_data_source
              raise RuntimeError(f"get_domain_data_source failed for {id}: ret = {ret}")
          RuntimeError: get_domain_data_source failed for default-data-source: ret = status {
            code: 12302
            message: "decrypt data source info failed, crypto/rsa: decryption error"
          }

        reason: Error
        startedAt: "2024-08-27T01:54:32Z"
  hostIP: 172.18.0.2
  phase: Failed
  startTime: "2024-08-27T01:54:31Z"
383004576 commented 2 months ago

您好,可以参考文档的排查步骤检查下配置。如果还是有问题请提供下双方容器日志。 https://www.secretflow.org.cn/zh-CN/docs/kuscia/v0.11.0b0/troubleshoot/run_job_failed

shnnosuke34725 commented 2 months ago

您好,以下是alice容器的日志,: 2024-08-27T09:54:36.054900253+08:00 stderr F WARNING:root:Since the GPL-licensed package unidecode is not installed, using Python's unicodedata package which yields worse results. 2024-08-27T09:54:37.123971248+08:00 stderr F Traceback (most recent call last): 2024-08-27T09:54:37.124006465+08:00 stderr F File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main 2024-08-27T09:54:37.124329442+08:00 stderr F return _run_code(code, main_globals, None, 2024-08-27T09:54:37.124343848+08:00 stderr F File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code 2024-08-27T09:54:37.124549198+08:00 stderr F exec(code, run_globals) 2024-08-27T09:54:37.124582165+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 547, in 2024-08-27T09:54:37.124986638+08:00 stderr F main() 2024-08-27T09:54:37.125011657+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in call 2024-08-27T09:54:37.125668582+08:00 stderr F return self.main(args, kwargs) 2024-08-27T09:54:37.125682953+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main 2024-08-27T09:54:37.126313236+08:00 stderr F rv = self.invoke(ctx) 2024-08-27T09:54:37.126325539+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke 2024-08-27T09:54:37.127183089+08:00 stderr F return ctx.invoke(self.callback, ctx.params) 2024-08-27T09:54:37.127197327+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke 2024-08-27T09:54:37.127663725+08:00 stderr F return __callback(args, **kwargs) 2024-08-27T09:54:37.127674377+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/entry.py", line 502, in main 2024-08-27T09:54:37.128090421+08:00 stderr F datasource = get_domain_data_source(datasource_stub, datasource_id) 2024-08-27T09:54:37.1281264+08:00 stderr F File "/usr/local/lib/python3.10/site-packages/secretflow/kuscia/datamesh.py", line 115, in get_domain_data_source 2024-08-27T09:54:37.128246377+08:00 stderr F raise RuntimeError(f"get_domain_data_source failed for {id}: ret = {ret}") 2024-08-27T09:54:37.128274142+08:00 stderr F RuntimeError: get_domain_data_source failed for default-data-source: ret = status { 2024-08-27T09:54:37.128285058+08:00 stderr F code: 12302 2024-08-27T09:54:37.128295707+08:00 stderr F message: "decrypt data source info failed, crypto/rsa: decryption error" 2024-08-27T09:54:37.128305992+08:00 stderr F } 2024-08-27T09:54:37.128315981+08:00 stderr F

383004576 commented 2 months ago

其他机器可以,有可能是缓存数据影响,可以试一下删除ls /${USER}/kuscia后,重新安装。

github-actions[bot] commented 1 month ago

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.