mc2-project / secure-xgboost

Secure collaborative training and inference for XGBoost.
https://mc2-project.github.io/secure-xgboost/
Apache License 2.0
105 stars 32 forks source link

different result between secure xgboost and xgboost #132

Closed wangsu502 closed 3 years ago

wangsu502 commented 3 years ago

Hi, I found the prediction results from the latest secure xgboost are always different from xgboost 1.2.0. Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv

My python code: `import csv import xgboost as xgb

import random

random.seed(0)

RAW_DATA_FILE_PATH = '/home/test/energydata_complete.csv' TRAIN_FILE_PATH = '/home/test/regression/train.txt' TEST_FILE_PATH = '/home/test/regression/test.txt'

def main():

Data pre-processing

with open(RAW_DATA_FILE_PATH, 'r') as fin, open(TRAIN_FILE_PATH, 'w') as train_fout, open(TEST_FILE_PATH, 'w') as test_fout:
    reader = csv.reader(fin)
    _ = next(reader)
    for row in reader:
        label = row[1]
        line = str(float(label)) + ' ' + ','.join(['{}:{}'.format(no, float(f)) for no, f in enumerate(row[2:])])
        if random.random() < 0.1:
            fout = test_fout
        else:
            fout = train_fout
        fout.write(line + '\n')

dtrain = xgb.DMatrix(TRAIN_FILE_PATH)
dtest = xgb.DMatrix(TEST_FILE_PATH)

param = {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}
bst = xgb.train(param, dtrain, 10)

mae, n = 0, 0
with open(TEST_FILE_PATH, 'r') as fin:
    for line, y_pred in zip(fin, bst.predict(dtest)):
        y = float(line.strip().split()[0])
        y_pred = float(y_pred)
        mae += abs(y - y_pred)
        n += 1
mae = mae / n
print(mae)

if name == 'main': main()`

The result is [05:31:05] WARNING: xgboost/src/learner.cc:516: Parameters: { n_estimators } might not be used.

This may not be accurate due to some parameters are only used in language bindings but passed down to XGBoost core. Or some parameters are not used but slip through this verification. Please open an issue if you find above cases.

46.5267243112152

And below is my python code for secure xgboost: ` xgb.generate_client_key(key_file) xgb.encrypt_file(inputfile_train, inputfile_train + ".enc", key_file) xgb.encrypt_file(inputfile_test, inputfile_test + ".enc", key_file)

    print("Init user and enclave parameters")
    xgb.init_client(config=config_file)
    xgb.init_server(enclave_image=xgboost_enclave_image_file, client_list=[user_name], log_verbosity=1)

    # Remote Attestation
    print("Remote attestation")

    # Note: Simulation mode does not support attestation
    # pass in `verify=False` to attest()
    xgb.attest(verify=False)

   print("Creating training matrix from encrypted file")
    dtrain = xgb.DMatrix({user_name: inputfile_train + ".enc"})

    print("Creating test matrix from encrypted file")
    dtest = xgb.DMatrix({user_name: inputfile_test + ".enc"})

    print("Beginning Training")
    # Set training parameters
    param = {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}

    print("Set training parameters:")
    print(param)
    # Train and evaluate
    booster = xgb.train(param, dtrain, int(number_of_rounds), evals=[(dtrain, "train"), (dtest, "test")])

    # Get encrypted predictions
    print("\nModel Predictions: ")
    predictions, num_preds = booster.predict(dtest, decrypt=False)

    # Decrypt predictions
    print(booster.decrypt_predictions(predictions, num_preds))
    with open(outputfile_predict_result, 'w') as rf:
            rf.write(pd.Series(booster.decrypt_predictions(predictions, num_preds)).to_json(orient='values'))

    mae, n = 0, 0
    with open(inputfile_test, 'r') as fin:
        for line, y_pred in zip(fin, booster.decrypt_predictions(predictions, num_preds)):
            y = float(line.strip().split()[0])
            y_pred = float(y_pred)
            mae += abs(y - y_pred)
            n += 1
    mae = mae / n
    print(mae)`

And the result is: Beginning Training Set training parameters: {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}

Model Predictions: [ 74.07482 47.515785 47.590363 ... 117.18135 157.50322 114.312454] 25.92078459969717

The MAE of secure xgboost is litter than the normal xgboost. Is there any optimization applied to the implementation? Could you help to look at this issue? I think the results should be the same when xgb got the same parameters.

thanks, Su

wangsu502 commented 3 years ago

set the number of rounds to 10 for both xgb implementations, the results: sgx 44.7969234672623 vs normal 46.5267243112152

podcastinator commented 3 years ago

Hi Su,

I tried to reproduce this but I'm getting the same results with both: 46.5267243112152 Would you please double check?

I built XGBoost release_1.2.0 from source, and ran the following script:

import xgboost as xgb
import os
import csv
import random

random.seed(0)

DIR = os.path.dirname(os.path.realpath(__file__))
HOME_DIR = DIR + "/../../../"
RAW_DATA_FILE_PATH = HOME_DIR + 'demo/data/energydata_complete.csv'
TRAIN_FILE_PATH = HOME_DIR + 'demo/data/train.txt'
TEST_FILE_PATH = HOME_DIR + 'demo/data/test.txt'

# Data pre-processing
with open(RAW_DATA_FILE_PATH, 'r') as fin, open(TRAIN_FILE_PATH, 'w') as train_fout, open(TEST_FILE_PATH, 'w') as test_fout:
    reader = csv.reader(fin)
    _ = next(reader)
    for row in reader:
        label = row[1]
        line = str(float(label)) + ' ' + ','.join(['{}:{}'.format(no, float(f)) for no, f in enumerate(row[2:])])
        if random.random() < 0.1:
            fout = test_fout
        else:
            fout = train_fout
        fout.write(line + '\n')

dtrain = xgb.DMatrix(TRAIN_FILE_PATH)
dtest = xgb.DMatrix(TEST_FILE_PATH)

param = {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}
bst = xgb.train(param, dtrain, 10)

mae, n = 0, 0
with open(TEST_FILE_PATH, 'r') as fin:
    for line, y_pred in zip(fin, bst.predict(dtest)):
        y = float(line.strip().split()[0])
        y_pred = float(y_pred)
        mae += abs(y - y_pred)
        n += 1
mae = mae / n
print(mae)

And for Secure XGBoost, I built the latest code on the master branch and ran the following:

import securexgboost as xgb
import os
import csv
import random

random.seed(0)

user_name = "user1"
DIR = os.path.dirname(os.path.realpath(__file__))
HOME_DIR = DIR + "/../../../"
RAW_DATA_FILE_PATH = HOME_DIR + 'demo/data/energydata_complete.csv'
TRAIN_FILE_PATH = HOME_DIR + 'demo/data/train.txt'
TEST_FILE_PATH = HOME_DIR + 'demo/data/test.txt'

key_file = "../../data/key_zeros.txt"
xgb.generate_client_key(key_file)
xgb.encrypt_file(TRAIN_FILE_PATH, TRAIN_FILE_PATH + ".enc", key_file)
xgb.encrypt_file(TEST_FILE_PATH, TEST_FILE_PATH + ".enc", key_file)

print("Init user and enclave parameters")
xgb.init_client(config="config.ini")
xgb.init_server(enclave_image=HOME_DIR + "build/enclave/xgboost_enclave.signed", client_list=["user1"], log_verbosity=0)

# Remote Attestation
print("Remote attestation")

# Note: Simulation mode does not support attestation
# pass in `verify=False` to attest()
xgb.attest(verify=False)

print("Creating training matrix from encrypted file")
dtrain = xgb.DMatrix({user_name: TRAIN_FILE_PATH + ".enc"})

print("Creating test matrix from encrypted file")
dtest = xgb.DMatrix({user_name: TEST_FILE_PATH + ".enc"})

param = {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}
# booster = xgb.train(param, dtrain, 10)
booster = xgb.train(param, dtrain, 10, evals=[(dtrain, "train"), (dtest, "test")])

# Get encrypted predictions
print("\nModel Predictions: ")
predictions, num_preds = booster.predict(dtest, decrypt=False)

# Decrypt predictions
print(booster.decrypt_predictions(predictions, num_preds))

mae, n = 0, 0
with open(TEST_FILE_PATH, 'r') as fin:
    for line, y_pred in zip(fin, booster.decrypt_predictions(predictions, num_preds)):
        y = float(line.strip().split()[0])
        y_pred = float(y_pred)
        mae += abs(y - y_pred)
        n += 1
mae = mae / n
print(mae)

In both cases, I get the same result: 46.5267243112152

wangsu502 commented 3 years ago

Emmmm.... Did you modify the cmake options? all by defualt?

wangsu502 commented 3 years ago

Hi Su,

I tried to reproduce this but I'm getting the same results with both: 46.5267243112152 Would you please double check?

I built XGBoost release_1.2.0 from source, and ran the following script:

import xgboost as xgb
import os
import csv
import random

random.seed(0)

DIR = os.path.dirname(os.path.realpath(__file__))
HOME_DIR = DIR + "/../../../"
RAW_DATA_FILE_PATH = HOME_DIR + 'demo/data/energydata_complete.csv'
TRAIN_FILE_PATH = HOME_DIR + 'demo/data/train.txt'
TEST_FILE_PATH = HOME_DIR + 'demo/data/test.txt'

# Data pre-processing
with open(RAW_DATA_FILE_PATH, 'r') as fin, open(TRAIN_FILE_PATH, 'w') as train_fout, open(TEST_FILE_PATH, 'w') as test_fout:
    reader = csv.reader(fin)
    _ = next(reader)
    for row in reader:
        label = row[1]
        line = str(float(label)) + ' ' + ','.join(['{}:{}'.format(no, float(f)) for no, f in enumerate(row[2:])])
        if random.random() < 0.1:
            fout = test_fout
        else:
            fout = train_fout
        fout.write(line + '\n')

dtrain = xgb.DMatrix(TRAIN_FILE_PATH)
dtest = xgb.DMatrix(TEST_FILE_PATH)

param = {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}
bst = xgb.train(param, dtrain, 10)

mae, n = 0, 0
with open(TEST_FILE_PATH, 'r') as fin:
    for line, y_pred in zip(fin, bst.predict(dtest)):
        y = float(line.strip().split()[0])
        y_pred = float(y_pred)
        mae += abs(y - y_pred)
        n += 1
mae = mae / n
print(mae)

And for Secure XGBoost, I built the latest code on the master branch and ran the following:

import securexgboost as xgb
import os
import csv
import random

random.seed(0)

user_name = "user1"
DIR = os.path.dirname(os.path.realpath(__file__))
HOME_DIR = DIR + "/../../../"
RAW_DATA_FILE_PATH = HOME_DIR + 'demo/data/energydata_complete.csv'
TRAIN_FILE_PATH = HOME_DIR + 'demo/data/train.txt'
TEST_FILE_PATH = HOME_DIR + 'demo/data/test.txt'

key_file = "../../data/key_zeros.txt"
xgb.generate_client_key(key_file)
xgb.encrypt_file(TRAIN_FILE_PATH, TRAIN_FILE_PATH + ".enc", key_file)
xgb.encrypt_file(TEST_FILE_PATH, TEST_FILE_PATH + ".enc", key_file)

print("Init user and enclave parameters")
xgb.init_client(config="config.ini")
xgb.init_server(enclave_image=HOME_DIR + "build/enclave/xgboost_enclave.signed", client_list=["user1"], log_verbosity=0)

# Remote Attestation
print("Remote attestation")

# Note: Simulation mode does not support attestation
# pass in `verify=False` to attest()
xgb.attest(verify=False)

print("Creating training matrix from encrypted file")
dtrain = xgb.DMatrix({user_name: TRAIN_FILE_PATH + ".enc"})

print("Creating test matrix from encrypted file")
dtest = xgb.DMatrix({user_name: TEST_FILE_PATH + ".enc"})

param = {'max_depth': 5, 'eta': 0.3, 'objective': 'reg:squarederror', 'n_estimators': 200, 'alpha': 0, 'lambda': 100, 'sketch_eps': 0.03}
# booster = xgb.train(param, dtrain, 10)
booster = xgb.train(param, dtrain, 10, evals=[(dtrain, "train"), (dtest, "test")])

# Get encrypted predictions
print("\nModel Predictions: ")
predictions, num_preds = booster.predict(dtest, decrypt=False)

# Decrypt predictions
print(booster.decrypt_predictions(predictions, num_preds))

mae, n = 0, 0
with open(TEST_FILE_PATH, 'r') as fin:
    for line, y_pred in zip(fin, booster.decrypt_predictions(predictions, num_preds)):
        y = float(line.strip().split()[0])
        y_pred = float(y_pred)
        mae += abs(y - y_pred)
        n += 1
mae = mae / n
print(mae)

In both cases, I get the same result: 46.5267243112152

Could you please provide your running environment information? python version, numpy version and etc.

podcastinator commented 3 years ago

Emmmm.... Did you modify the cmake options? all by defualt?

Default options except that I ran it in simulation mode (OE_DEBUG=1 and SIMULATE=ON), but I don't believe that should matter. Did you build regular XGBoost from source? If not, can you try that please? https://github.com/dmlc/xgboost/tree/v1.2.0

wangsu502 commented 3 years ago

Emmmm.... Did you modify the cmake options? all by defualt?

Default options except that I ran it in simulation mode (OE_DEBUG=1 and SIMULATE=ON), but I don't believe that should matter. Did you build regular XGBoost from source? If not, can you try that please? https://github.com/dmlc/xgboost/tree/v1.2.0

hi, I've fixed the issue by rebuilding everything. XD

podcastinator commented 3 years ago

Great, closing this issue in that case!