Open MLikeWater opened 2 weeks ago
The current engineering implementation of APSI is not perfect and can only meet the performance of a few queries for algorithm testing. We will define PIR-related interfaces and related optimizations in the future.
@MLikeWater
You can try adding the parameter recv_timeout_ms
in the link_config
and increasing its value. Reference: https://github.com/secretflow/psi/blob/c2f460e20efbe74c3a80b26c98e8bb89295d717f/docs/reference/launch_config.md?plain=1#L85
The current engineering implementation of APSI is not perfect and can only meet the performance of a few queries for algorithm testing. We will define PIR-related interfaces and related optimizations in the future.
Received, thank you for the information provided. In the testing demo phase, for handling large data volumes, besides the recv_timeout_ms
parameter suggested by @tongke6, what other configurations are needed?
Looking forward to your response, thanks.
@huocun-ant Using the parameters from https://github.com/secretflow/psi/blob/main/examples/pir/apsi/parameters/256M-4096.json for validation, it is currently possible to run with tens of millions of data entries, although the speed is relatively slow. The log will output the process.
There is another issue, the sender's original data volume is only 3.2G, but during the sender setup phase, the generated data directory is quite large, reaching 153G(apsi_sender_bucket directory).
{
"table_params": {
"hash_func_count": 3,
"table_size": 6144,
"max_items_per_bin": 4000
},
"item_params": {
"felts_per_item": 4
},
"query_params": {
"ps_low_degree": 310,
"query_powers": [ 1, 4, 10, 11, 28, 33, 78, 118, 143, 311, 1555]
},
"seal_params": {
"plain_modulus_bits": 26,
"poly_modulus_degree": 8192,
"coeff_modulus_bits": [ 50, 50, 50, 38, 30 ]
}
}
{
"apsi_sender_config": {
"threads": 1,
"log_level": "info",
"compress": true,
"source_file": "/home/admin/dev/demo/data/bank_data_5000w.csv",
"params_file": "/home/admin/dev/demo/data/100K-1-16.json",
"save_db_only": true,
"experimental_enable_bucketize": true,
"experimental_bucket_cnt": 10000,
"experimental_bucket_folder": "/home/admin/dev/demo/data/apsi_sender_bucket/",
"experimental_db_generating_process_num": 16,
"experimental_bucket_group_cnt": 512
}
}
{
"apsi_sender_config": {
"source_file": "/home/admin/dev/demo/data/bank_data_5000w.csv",
"params_file": "/home/admin/dev/demo/data/100K-1-16.json",
"experimental_enable_bucketize": true,
"compress": true,
"experimental_bucket_cnt": 10000,
"experimental_bucket_folder": "/home/admin/dev/demo/data/apsi_sender_bucket/",
"experimental_db_generating_process_num": 16,
"experimental_bucket_group_cnt": 512
},
"link_config": {
"parties": [
{
"id": "sender",
"host": "127.0.0.1:5300"
},
{
"id": "receiver",
"host": "127.0.0.1:5400"
}
]
},
"self_link_party": "sender"
}
"params_file": "/home/admin/dev/demo/data/100K-1-16.json"
means your bucket size is 10k, and your query size is 1 row. So experimental_bucket_cnt
should be 5000w / 100k = 500
, you can set experimental_bucket_cnt
to 500. There is a issue, your query size is large, but 100K-1-16.json
is optimized for 1 row query, therefore, the parameters contain optimization space.
Describe the bug
Sender Setup Stage
sender terminal
log:
receiver terminal
log:
result
Once the receiver starts, the sender fails, and the log is as follows:
Steps To Reproduce
config/apsi_sender_setup_bucket.json
{ "apsi_sender_config": { "threads": 1, "log_level": "info", "source_file": "/home/admin/dev/demo/data/bank_data_5000w.csv", "params_file": "/home/admin/dev/demo/data/100K-1-16.json", "save_db_only": true, "experimental_enable_bucketize": true, "experimental_bucket_cnt": 10000, "experimental_bucket_folder": "/home/admin/dev/demo/data/apsi_sender_bucket/", "experimental_db_generating_process_num": 16, "experimental_bucket_group_cnt": 512 } }
config/apsi_sender_online_bucket.json
{ "apsi_sender_config": { "source_file": "/home/admin/dev/demo/data/bank_data_5000w.csv", "params_file": "/home/admin/dev/demo/data/100K-1-16.json", "experimental_enable_bucketize": true, "experimental_bucket_cnt": 10000, "experimental_bucket_folder": "/home/admin/dev/demo/data/apsi_sender_bucket/", "experimental_db_generating_process_num": 16, "experimental_bucket_group_cnt": 512 }, "link_config": { "parties": [ { "id": "sender", "host": "127.0.0.1:5300" }, { "id": "receiver", "host": "127.0.0.1:5400" } ] }, "self_link_party": "sender" }
config/apsi_receiver_bucket.json
{ "apsi_receiver_config": { "query_file": "/home/admin/dev/demo/data/meituan_data_2500w.csv", "output_file": "/home/admin/dev/demo/data/batch_result.csv", "params_file": "/home/admin/dev/demo/data/100K-1-16.json", "experimental_enable_bucketize": true, "experimental_bucket_cnt": 10000 }, "link_config": { "parties": [ { "id": "sender", "host": "127.0.0.1:5300" }, { "id": "receiver", "host": "127.0.0.1:5400" } ] }, "self_link_party": "receiver" }
Expected behavior
The sender has a data volume of 50 million, consisting of keys and values, where the key is a hash value of a phone number and starts with any letter from A to K. The receiver has a data volume of 25 million, containing only keys. The expected result is that the receiver obtains the intersection of 25 million keys along with the corresponding values.
Version
v0.4.2b0
Operating system
Ubuntu 20.04
Hardware Resources
48C96G