secretflow / secretflow

A unified framework for privacy-preserving data analysis and machine learning
https://www.secretflow.org.cn/docs/secretflow/en/
Apache License 2.0
2.3k stars 380 forks source link

1.5.0b0版本使用spu执行任务完成后,程序无法自动退出 #1281

Open nfangxu opened 3 months ago

nfangxu commented 3 months ago

Issue Type

Bug

Source

binary

Secretflow Version

1.5.0b0

OS Platform and Distribution

Centos 7.9.2009

Python version

3.10.14

Bazel version

No response

GCC/Compiler version

No response

What happend and What you expected to happen.

两台机器分布启动 ray 集群:

# 192.168.3.21
export ip="192.168.3.21"
ray start --head --node-ip-address="${ip}" --port="9010" --include-dashboard=False --disable-usage-stats
# 192.168.3.23
export ip="192.168.3.23"
ray start --head --node-ip-address="${ip}" --port="9010" --include-dashboard=False --disable-usage-stats

分别执行:

# 192.168.3.23
python3 demo.py -p=client
# 192.168.3.21
python3 demo.py -p=server

执行完毕输出日志如下:

[root@sf-3-23 ~]# python3 demo.py -p=client
2024-05-09 11:24:55,420 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 192.168.3.23:9010...
2024-05-09 11:24:55,436 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-09 11:24:55.481 INFO api.py:233 [client] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'client': '0.0.0.0:9020', 'server': '192.168.3.21:9020'}, 'CURRENT_PARTY_NAME': 'client', 'TLS_CONFIG': {}}
2024-05-09 11:24:55.481 DEBUG message_queue.py:56 [client] -- [Anonymous_job] Starting new thread[DataSendingQueueThread] for message polling.
2024-05-09 11:24:55.482 DEBUG cleanup.py:67 [client] -- [Anonymous_job] Start check sending thread.
2024-05-09 11:24:55.482 DEBUG message_queue.py:56 [client] -- [Anonymous_job] Starting new thread[ErrorSendingQueueThread] for message polling.
2024-05-09 11:24:55.483 DEBUG cleanup.py:69 [client] -- [Anonymous_job] Start check error sending thread.
2024-05-09 11:24:55.483 DEBUG barriers.py:445 [client] -- [Anonymous_job] Starting ReceiverProxyActor with options: {'max_concurrency': 1, 'name': 'SenderReceiverProxyActor'}
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:24:56.954 INFO link.py:38 [client] -- [Anonymous_job] brpc options: {'message_max_size_in_bytes': 2147483647, 'timeout_in_ms': 1800000, 'connect_retry_times': 8640, 'connect_retry_interval_ms': 10000, 'recv_timeout_ms': 21600000, 'http_timeout_ms': 21600000, 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:24:56.954 WARNING link_config.py:34 [client] -- [Anonymous_job] http_timeout_ms and timeout_ms are set at the same time, http_timeout_ms 21600000 will be used.
(SenderReceiverProxyActor pid=17006) I0509 11:24:56.980060 17006 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=9020.
(SenderReceiverProxyActor pid=17006) W0509 11:24:56.980113 17006 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
2024-05-09 11:25:02.569 INFO barriers.py:465 [client] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-05-09 11:25:02.569 INFO barriers.py:520 [client] -- [Anonymous_job] Try ping ['server'] at 0 attemp, up to 3600 attemps.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:02.579 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id ping of server from ping without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:02.579 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id ping of server from ping. Response is True
=========================Start
2024-05-09 11:25:02.631 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function get_data at 0x7f406d3205e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:02.632 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:02.636 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7f404853d480>, num_returns=None, args len: 4, kwargs len: 0.
(_run pid=4520) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': 
(_run pid=4520) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=4520) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2024-05-09 11:25:04.529 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.540 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7f404853e320>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:04.541 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.541 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.542 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:04.544 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:04.545 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
2024-05-09 11:25:04.545 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function get_data at 0x7f406d3205e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:04.545 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7f404853e320>, num_returns=None, args len: 4, kwargs len: 0.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:04.530 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id 7 of server from 6#0 without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:04.530 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id 7 of server from 6#0. Response is True
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.609 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id 10 of server from 8#1 without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.609 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id 10 of server from 8#1. Response is True
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.611 DEBUG barriers.py:397 [client] -- [Anonymous_job] Sending send data to seq_id 10 of server from 8#3 without credentials.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.611 DEBUG barriers.py:408 [client] -- [Anonymous_job] Succeeded to send data to seq_id 10 of server from 8#3. Response is True
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.612 DEBUG link.py:93 [client] -- [Anonymous_job] Getting data for 15 from 14#0 of server
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:05.612 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for ping from ping.
2024-05-09 11:25:06.650 DEBUG pyu.py:105 [client] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7f404853e680>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:06.651 DEBUG utils.py:66 [client] -- [Anonymous_job] Insert recv_op, arg task id 16#1, current task id 17
2024-05-09 11:25:06.652 DEBUG utils.py:66 [client] -- [Anonymous_job] Insert recv_op, arg task id 16#2, current task id 17
2024-05-09 11:25:06.653 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:06.653 DEBUG utils.py:63 [client] -- [Anonymous_job] Insert fed object, arg.party=client
2024-05-09 11:25:06.653 DEBUG fed_actor.py:104 [client] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
=========================Success
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.647 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for 15 from 14#0.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.648 DEBUG link.py:120 [client] -- [Anonymous_job] Getted data for 15 from 14#0 of server.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.654 DEBUG link.py:93 [client] -- [Anonymous_job] Getting data for 17 from 16#1 of server
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.655 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for 17 from 16#1.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.655 DEBUG link.py:120 [client] -- [Anonymous_job] Getted data for 17 from 16#1 of server.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.657 DEBUG link.py:93 [client] -- [Anonymous_job] Getting data for 17 from 16#2 of server
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.657 DEBUG link.py:114 [client] -- [Anonymous_job] Received data for 17 from 16#2.
(SenderReceiverProxyActor pid=17006) 2024-05-09 11:25:06.657 DEBUG link.py:120 [client] -- [Anonymous_job] Getted data for 17 from 16#2 of server.
[root@sf-3-21 ~]# python3 demo.py -p=server
2024-05-09 11:24:52,188 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 192.168.3.21:9010...
2024-05-09 11:24:52,203 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-09 11:24:52.249 INFO api.py:233 [server] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'client': '192.168.3.23:9020', 'server': '0.0.0.0:9020'}, 'CURRENT_PARTY_NAME': 'server', 'TLS_CONFIG': {}}
2024-05-09 11:24:52.249 DEBUG message_queue.py:56 [server] -- [Anonymous_job] Starting new thread[DataSendingQueueThread] for message polling.
2024-05-09 11:24:52.250 DEBUG cleanup.py:67 [server] -- [Anonymous_job] Start check sending thread.
2024-05-09 11:24:52.250 DEBUG message_queue.py:56 [server] -- [Anonymous_job] Starting new thread[ErrorSendingQueueThread] for message polling.
2024-05-09 11:24:52.250 DEBUG cleanup.py:69 [server] -- [Anonymous_job] Start check error sending thread.
2024-05-09 11:24:52.250 DEBUG barriers.py:445 [server] -- [Anonymous_job] Starting ReceiverProxyActor with options: {'max_concurrency': 1, 'name': 'SenderReceiverProxyActor'}
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:24:53.721 INFO link.py:38 [server] -- [Anonymous_job] brpc options: {'message_max_size_in_bytes': 2147483647, 'timeout_in_ms': 1800000, 'connect_retry_times': 8640, 'connect_retry_interval_ms': 10000, 'recv_timeout_ms': 21600000, 'http_timeout_ms': 21600000, 'exit_on_sending_failure': True}
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:24:53.722 WARNING link_config.py:34 [server] -- [Anonymous_job] http_timeout_ms and timeout_ms are set at the same time, http_timeout_ms 21600000 will be used.
(SenderReceiverProxyActor pid=24536) I0509 11:24:53.749538 24536 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=9020.
(SenderReceiverProxyActor pid=24536) W0509 11:24:53.749588 24536 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
(SenderReceiverProxyActor pid=24536) I0509 11:24:53.869094 24630 external/com_github_brpc_brpc/src/brpc/socket.cpp:2466] Checking Socket{id=0 addr=192.168.3.23:9020} (0x3513080)
(SenderReceiverProxyActor pid=24536) I0509 11:24:59.871872 24662 external/com_github_brpc_brpc/src/brpc/socket.cpp:2526] Revived Socket{id=0 addr=192.168.3.23:9020} (0x3513080) (Connectable)
2024-05-09 11:25:02.792 INFO barriers.py:465 [server] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-05-09 11:25:02.792 INFO barriers.py:520 [server] -- [Anonymous_job] Try ping ['client'] at 0 attemp, up to 3600 attemps.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.799 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id ping of client from ping without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.800 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id ping of client from ping. Response is True
=========================Start
2024-05-09 11:25:02.852 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function get_data at 0x7fa10e3005e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:02.852 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7fa10471b0a0>, num_returns=None, args len: 4, kwargs len: 0.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.856 DEBUG link.py:93 [server] -- [Anonymous_job] Getting data for 7 from 6#0 of client
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:02.857 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for ping from ping.
2024-05-09 11:25:04.757 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7fa10471b130>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:04.758 DEBUG utils.py:66 [server] -- [Anonymous_job] Insert recv_op, arg task id 8#1, current task id 10
2024-05-09 11:25:04.760 DEBUG utils.py:66 [server] -- [Anonymous_job] Insert recv_op, arg task id 8#3, current task id 10
2024-05-09 11:25:04.762 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:04.764 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:04.764 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
2024-05-09 11:25:04.772 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function get_data at 0x7fa10e3005e0>, num_returns=None, args len: 1, kwargs len: 0.
2024-05-09 11:25:04.772 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:04.778 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.get_shares_chunk_count at 0x7fa10471b130>, num_returns=None, args len: 4, kwargs len: 0.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:04.753 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for 7 from 6#0.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:04.754 DEBUG link.py:120 [server] -- [Anonymous_job] Getted data for 7 from 6#0 of client.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:04.761 DEBUG link.py:93 [server] -- [Anonymous_job] Getting data for 10 from 8#1 of client
2024-05-09 11:25:06.716 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.720 DEBUG pyu.py:105 [server] -- [Anonymous_job] PYU remote function: <function pyu_to_spu.<locals>.run_spu_io at 0x7fa104719480>, num_returns=4, args len: 4, kwargs len: 0.
2024-05-09 11:25:06.722 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.722 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.722 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: infeed_share, num_returns: 1
2024-05-09 11:25:06.723 DEBUG utils.py:63 [server] -- [Anonymous_job] Insert fed object, arg.party=server
2024-05-09 11:25:06.723 DEBUG fed_actor.py:104 [server] -- [Anonymous_job] Actor method call: del_share, num_returns: 1
=========================Success
(_run pid=10863) INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': 
(_run pid=10863) INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
(_run pid=10863) INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.865 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for 10 from 8#1.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.866 DEBUG link.py:120 [server] -- [Anonymous_job] Getted data for 10 from 8#1 of client.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.867 DEBUG link.py:93 [server] -- [Anonymous_job] Getting data for 10 from 8#3 of client
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.868 DEBUG link.py:114 [server] -- [Anonymous_job] Received data for 10 from 8#3.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.868 DEBUG link.py:120 [server] -- [Anonymous_job] Getted data for 10 from 8#3 of client.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.868 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id 15 of client from 14#0 without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.869 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id 15 of client from 14#0. Response is True
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.870 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id 17 of client from 16#1 without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.870 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id 17 of client from 16#1. Response is True
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.871 DEBUG barriers.py:397 [server] -- [Anonymous_job] Sending send data to seq_id 17 of client from 16#2 without credentials.
(SenderReceiverProxyActor pid=24536) 2024-05-09 11:25:06.871 DEBUG barriers.py:408 [server] -- [Anonymous_job] Succeeded to send data to seq_id 17 of client from 16#2. Response is True

Reproduction code to reproduce the issue.

demo.py代码如下:

import argparse
import secretflow as sf
import logging

def ray_init(self_party):
    sf.shutdown()

    ip={
        "server": "192.168.3.21",
        "client": "192.168.3.23",
    }[self_party]

    sf.init(address=ip+":9010",
            cluster_config={
                'self_party': self_party,
                'parties': {
                    'client': {
                        'id': 'client',
                        'party': 'client',
                        'address': '192.168.3.23:9020',
                        'listen_addr': '0.0.0.0:9020',
                    },
                    'server': {
                        'id': 'server',
                        'party': 'server',
                        'address': '192.168.3.21:9020',
                        'listen_addr': '0.0.0.0:9020',
                    }
                },
            },
            log_to_driver=True,
            logging_level=logging.getLevelName(logging.DEBUG).lower(),
            cross_silo_comm_backend='brpc_link',
            cross_silo_comm_options={
                "message_max_size_in_bytes": (2 << 30) - 1,
                "timeout_in_ms": 30 * 60 * 1000,
                # BRPC Config
                "connect_retry_times": 6 * 60 * 24,
                "connect_retry_interval_ms": 10 * 1000,
                "recv_timeout_ms": 6 * 3600 * 1000,
                "http_timeout_ms": 6 * 3600 * 1000,
                },
            )

def spu_init():
    cluster_def = {
        "runtime_config": {
            "protocol": "SEMI2K",
            "field": "FM128",
            "fxp_fraction_bits": 32,
            "fxp_div_goldschmidt_iters": 10,
        },
        "nodes": [
            {
                "party": 'client',
                'address': '192.168.3.23:9030',
                "listen_address": "0.0.0.0:9030"
            },
            {
                "party": 'server',
                'address': '192.168.3.21:9030',
                "listen_address": "0.0.0.0:9030"
            },
        ],
    }

    # link_desc
    link_desc = {
        "connect_retry_times": 6 * 60 * 24,
        "connect_retry_interval_ms": 10 * 1000,
        "recv_timeout_ms": 6 * 3600 * 1000,
        "http_timeout_ms": 6 * 3600 * 1000,
        "throttle_window_size": 0,
        "brpc_channel_protocol": "http",
        "brpc_channel_connection_type": "pooled",
    }

    return sf.SPU(cluster_def=cluster_def, link_desc=link_desc)

def get_data(i):
    return i

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--party", default="", help="party id")
    args = parser.parse_args()

    # Ray init
    ray_init(args.party)
    spu_device = spu_init()
    pyus = [sf.PYU("client"), sf.PYU("server")]

    print("=========================Start")
    for pyu in pyus:
        pyu(get_data)(1).to(spu_device)

    print("=========================Success")
nfangxu commented 3 months ago

相关版本信息如下(spu 因为改动过,所以使用的是 0.8.0b0 版本):

# pip3 list | grep secretflow
secretflow                   1.5.0b0
secretflow-rayfed            0.2.1a1
secretflow-serving-lib       0.3.0.dev20240320
# pip3 list | grep spu
spu                          0.8.0b0
ian-huu commented 3 months ago

需要在脚本最后加上 sf.shutdown(),可能会看到报错 AttributeError: 'NoneType' object has no attribute 'get_job_name',这个是已知问题,会尽快修复。

此外,为了保证在 shutdown 之前执行完任务,建议在 shutdown 之前加上 sf.wait(某个结果),比如:


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", "--party", default="", help="party id")
    args = parser.parse_args()

    # Ray init
    ray_init(args.party)
    spu_device = spu_init()
    pyus = [sf.PYU("client"), sf.PYU("server")]

    print("=========================Start")
    spu_objs = []
    for pyu in pyus:
        obj = pyu(get_data)(1).to(spu_device)
        spu_objs.append(obj)

    print("=========================Success")

    sf.wait(spu_objs)

    sf.shutdown()
magic-hya commented 1 month ago

问题我也遇到了,加了shutdown也结束不了,还有rayfed没有结束,下面是我用ctrl+C才结束的,正常代码无法结束

[^C2024-08-02 03:19:57.864 WARNING api.py:60 [alice] -- [Anonymous_job] Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
2024-08-02 03:19:57.865 WARNING api.py:325 [alice] -- [Anonymous_job] Shutdowning rayfed unintendedly...
2024-08-02 03:19:57.865 INFO api.py:337 [alice] -- [Anonymous_job] Wait for data sending.
2024-08-02 03:19:57.867 INFO message_queue.py:70 [alice] -- [Anonymous_job] Notify message polling thread[DataSendingQueueThread] to exit.
2024-08-02 03:19:57.876 INFO message_queue.py:100 [alice] -- [Anonymous_job] The message polling thread[DataSendingQueueThread] was exited.
2024-08-02 03:19:57.876 INFO message_queue.py:70 [alice] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-08-02 03:19:57.943 INFO message_queue.py:100 [alice] -- [Anonymous_job] The message polling thread[ErrorSendingQueueThread] was exited.
2024-08-02 03:19:57.944 INFO api.py:352 [alice] -- [Anonymous_job] Shutdowned rayfed.
2024-08-02 03:19:57.944 CRITICAL api.py:356 [alice] -- [Anonymous_job] Exit now due to the previous error.]