secretflow / secretflow

A unified framework for privacy-preserving data analysis and machine learning
https://www.secretflow.org.cn/docs/secretflow/en/
Apache License 2.0
2.35k stars 391 forks source link

生产环境下SPU无法启动问题 #507

Closed Niclouge closed 1 year ago

Niclouge commented 1 year ago

Issue Type

Bug

Source

binary

Secretflow Version

secretflow 0.8.1b1

OS Platform and Distribution

ubuntu 20.04

Python version

3.8.16

Bazel version

No response

GCC/Compiler version

No response

What happend and What you expected to happen.

在生产环境中,练习纵向xgb,通过配置之后,无法启动SPU,然后在这个地方卡死,也没有提示,通过telnet之后才发现并没有启动起来。
两太机器能够相互ping通,ray采用的6677端口,proxy采用的是6678端口,spu采用的是6679端口,在运行代码之后,本地可以监听到6677和6678端口,无法监听到6679端口
从日志中可以看到,两方其实ray和proxy是连接了的。

其中一方的日志如下所示:
2023-05-06 14:05:36,507 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 192.168.200.109:6677...
2023-05-06 14:05:36,526 INFO worker.py:1538 -- Connected to Ray cluster.
2023-05-06 14:05:36 INFO api.py:202 [party_1] --  Started rayfed with {'CLUSTER_ADDRESSES': {'party_1': {'address': '192.168.200.109:6678'}, 'party_2': {'address': '192.168.200.210:6678'}}, 'CURRENT_PARTY_NAME': 'party_1', 'TLS_CONFIG': {}, 'CROSS_SILO_SERIALIZING_ALLOWED_LIST': None, 'CROSS_SILO_MESSAGES_MAX_SIZE_IN_BYTES': None, 'CROSS_SILO_TIMEOUT_IN_SECONDS': 3600}
2023-05-06 14:05:38 INFO barriers.py:353 [party_1] --  RecverProxy was successfully created.
(RecverProxyActor pid=9092) 2023-05-06 14:05:38 INFO barriers.py:114 [party_1] --  Successfully start Grpc service without credentials.
2023-05-06 14:05:40 INFO barriers.py:388 [party_1] --  SendProxy was successfully created.
2023-05-06 14:05:40 INFO barriers.py:463 [party_1] --  Try ping ['party_2'] at 0 attemp, up to 3600 attemps.
2023-05-06 14:05:40 INFO barriers.py:444 [party_1] --  Succeeded to ping party_2 on 192.168.200.210:6678, the result: OK.

另一方的日志如下所示:
 Connecting to existing Ray cluster at address: 192.168.200.210:6677...
2023-05-06 13:51:18,416 INFO worker.py:1538 -- Connected to Ray cluster.
2023-05-06 13:51:18 INFO api.py:202 [party_2] --  Started rayfed with {'CLUSTER_ADDRESSES': {'party_1': {'address': '192.168.200.109:6678'}, 'party_2': {'address': 'localhost:6678'}}, 'CURRENT_PARTY_NAME': 'party_2', 'TLS_CONFIG': {}, 'CROSS_SILO_SERIALIZING_ALLOWED_LIST': None, 'CROSS_SILO_MESSAGES_MAX_SIZE_IN_BYTES': None, 'CROSS_SILO_TIMEOUT_IN_SECONDS': 3600}
2023-05-06 13:51:19 INFO barriers.py:353 [party_2] --  RecverProxy was successfully created.
(RecverProxyActor pid=33432) 2023-05-06 13:51:19 INFO barriers.py:114 [party_2] --  Successfully start Grpc service without credentials.
2023-05-06 13:51:19 INFO barriers.py:388 [party_2] --  SendProxy was successfully created.
2023-05-06 13:51:19 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 0 attemp, up to 3600 attemps.
2023-05-06 13:51:19 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:21 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 1 attemp, up to 3600 attemps.
2023-05-06 13:51:21 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:23 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 2 attemp, up to 3600 attemps.
2023-05-06 13:51:23 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:25 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 3 attemp, up to 3600 attemps.
2023-05-06 13:51:25 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:27 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 4 attemp, up to 3600 attemps.
2023-05-06 13:51:27 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:29 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 5 attemp, up to 3600 attemps.
2023-05-06 13:51:29 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:31 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 6 attemp, up to 3600 attemps.
2023-05-06 13:51:31 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:33 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 7 attemp, up to 3600 attemps.
2023-05-06 13:51:33 INFO barriers.py:449 [party_2] --  Failed to ping party_1 on 192.168.200.109:6678, this could be normal, the possible reason is party_1 has not yet started.
2023-05-06 13:51:35 INFO barriers.py:463 [party_2] --  Try ping ['party_1'] at 8 attemp, up to 3600 attemps.
2023-05-06 13:51:35 INFO barriers.py:444 [party_2] --  Succeeded to ping party_1 on 192.168.200.109:6678, the result: OK.

Reproduction code to reproduce the issue.

具体SPU配置如下所示:
spu_config = {
    'nodes': [
        {
            'party': 'party_1', 
            'address': '192.168.200.109: 6679'
        },
        {
            'party': 'party_2', 
            'address': '192.168.200.210: 6679'
        }
    ], 
    'runtime_config': {
        'protocol': spu.spu_pb2.SEMI2K, 
        'field': spu.spu_pb2.FM128
    }
}
具体HEU配置如下:
heu_config = {
            'sk_keeper': {'party': "party_1"},
            'evaluators': [{'party': "party_2"}],
            'mode': 'PHEU',
            'he_parameters': {
                'schema': 'ou',
                'key_pair': {
                    'generate': {
                        # bit size should be 2048 to provide sufficient security.
                        'bit_size': 2048,
                    }
                }
            },
            'encoding': {
                'cleartext_type': 'DT_I32',
                'encoder': "IntegerEncoder",
                'encoder_args': {"scale": 1}
            }
        }
具体代码部分如下:
alice = sf.PYU('party_1')  
bob = sf.PYU('party_2')  
heu = sf.HEU(heu_config, spu_config['runtime_config']['field'])  

程序卡在了sf.HEU()这里,因为我在这句的上下文中添加了print()语句,只能打印之前的内容,而sf.HEU()之后的内容无法打印.
希望能够解决这个问题,同时在这里也想提一个优化,在SPU无法启动或无法连接时,能够给出一定的提示信息。
Niclouge commented 1 year ago

在运行程序时,端口监听情况如下:

image

Chrisdehe commented 1 year ago

嗨,@Niclouge辛苦把代码贴的全一些?我复现是没有问题的 代码 请参考:

import spu
import sys
import time
import logging
from sklearn.metrics import roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal
from secretflow.ml.boost.sgb_v import Sgb
from sklearn.metrics import roc_auc_score

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

parties = {
  'alice': {
    'address': 'xxx:9494'
  },
  'bob': {
    'address': 'xxx:9495'
  }
}

cluster_config ={
  'parties': parties,
  'self_party': sys.argv[1]
}

alice_ip = 'xxx仅是ip'
bob_ip = 'xxx仅是ip'
ip_party_map = {bob_ip:'bob', alice_ip:'alice'}

_system_config = {'lineage_pinning_enabled':False}
sf.shutdown()
# init cluster
sf.init(address='local', cluster_config=cluster_config, _system_config = _system_config, object_store_memory = 5 * 1024 * 1024 * 1024)

# SPU settings
cluster_def = {
    'nodes': [
        {'party': 'alice', 'id': 'local:0', 'address': 'xxx:9594'},
        {'party': 'bob', 'id': 'local:1', 'address': 'xxx:9595'},
    ],
    'runtime_config': {
        # SEMI2K support 2/3 PC, ABY3 only support 3PC, CHEETAH only support 2PC.
        # pls pay attention to size of nodes above. nodes size need match to PC setting.
        'protocol': spu.spu_pb2.SEMI2K,
        'field': spu.spu_pb2.FM128,
    },
}

# HEU settings
heu_config = {
    'sk_keeper': {'party': 'alice'},
    'evaluators': [{'party': 'bob'}],
    'mode': 'PHEU',
    'he_parameters': {
        # ou is a fast encryption schema that is as secure as paillier.
        'schema': 'ou',
        'key_pair': {
            'generate': {
                # bit size should be 2048 to provide sufficient security.
                'bit_size': 2048,
            },
        },
    },
    'encoding': {
        'cleartext_type': 'DT_I32',
        'encoder': "IntegerEncoder",
        'encoder_args': {"scale": 1},
    }
}

alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])

from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()
x, y = ds['data'], ds['target']

v_data = FedNdarray(
    {
        alice: (alice(lambda: x[:, :15])()),
        bob: (bob(lambda: x[:, 15:])()),
    },
    partition_way=PartitionWay.VERTICAL,
)
label_data = FedNdarray(
    {alice: (alice(lambda: y)())},
    partition_way=PartitionWay.VERTICAL,
)

params = {
            'num_boost_round': 5,
            'max_depth': 5,
            # about 13 bin numbers
            'sketch_eps': 0.08,
            # use 'linear' if want to do regression
            # for classification, currently only supports binary classfication
            'objective': 'logistic',
            'reg_lambda': 0.3,
            'subsample': 0.9,
            'colsample_by_tree': 0.9,
        }

sgb = Sgb(heu)
model = sgb.train(params, v_data, label_data)

yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"auc: {roc_auc_score(y, yhat)}")
Chrisdehe commented 1 year ago

版本0.8.2b1

Niclouge commented 1 year ago

我已经找到问题,原来SPU的配置还会和下面的HEU配置挂钩, 在HEU中的sk-keeper必须和PYU命名一致