Closed DylanWangWQF closed 1 year ago
Another question: does the id3
implemented in mpyc support processing big dataset (e.g., number of data samples = 50,000, features=14).
Maybe first your question about larger datasets. Numbers like 50,000 samples should be fine. To quickly experiment with this you can insert transactions *= 20
to take multiple copies of a given dataset right before line 110 in the demo:
Running the demo python np_id3gini.py -i 4 -l 96 -M 3
then still finishes within 30 seconds, however this time for 63920 samples instead of 3196 samples. The resulting decision tree is still the same. But note the -l 96
switch where we override the preset bit length of l=69
for dataset i=4
, to enlarge the range of values for the secure integer type sufficiently for the MPC calculations.
Regarding your other question about the use of T * S_A
. Indeed, to make this work S_A
should be a secure NumPy array, like T
already is. The error message says that T
is not an integer, which is required when using *
with a Python list (like we did above with the assignment transactions *= 20
in which transactions
is a list).
You can see how to compute S_A
directly using operations on secure Numpy arrays. For similar code, take a close look at how things are done in the np_solver.py demo. There, matrix multiplication @
is used to compute dot products with unit vectors, etc.
Hi @lschoe
I have implemented the version of SID3T using secure Numpy arrays.
The result tree is totally the same as that generated in id3gini.py
and np_id3gini.py
.
But the performance is not expected. Note that I currently only test for binary features.
ForSPECT
dataset, it's actually faster than the previous result (~35s).
But for KRKPA7
dataset (35 features, remove one nonbinary feature), it tasks over ~300s which is even much slower than the version without using Numpy arrays.
If you want to test the runtime, here is the code of np_SID3T.py.txt, and the binary dataset: KRKPA7_36.csv, binary_tennis.csv.
SPECT
is originally the dataset with binary features.
Hi @DylanWangWQF
Right very good, your code works as intended. To get better performance it's important to avoid unnecessary conversions (from list over secure integers to secure arrays), as these can be rather time consuming even if there's no actual secure computation involved.
You can use the following lines as replacement for your code:
k = CT.argmax(key=SecureFraction, raw=True, raw2=False)
A = list(RR)[await mpc.output(k @ np.arange(len(k)))]
logging.info(f'Attribute node {A}')
R_A = R + k
S_A = (S.swapaxes(0, 2) @ np.concatenate((k, np.array([0])))).swapaxes(0, 1)
also using this line to initialize S
:
S = secint.array(np.array([[[t[i] == j for t in transactions] for j in attr_ranges[i]] for i in range(d)]))
I've tested this with the SPECT dataset and the performance gets a lot better already, so I wonder what you get this way for KRKPA7.
Note the use of raw = True
as input to argmax()
, to get the secret index k
as a unit vector. (The names raw
and raw2
for these keyword arguments will be replaced by more meaningful names soon.)
HI @lschoe
This sounds wonderful and is easier to use.
(The names
raw
andraw2
for these keyword arguments will be replaced by more meaningful names soon.)
I have tested this for KRKPA7
dataset (35 features, remove one nonbinary feature), it only takes 6.45s, which improves a lot (compared to the previous ~300s).
I've tested this with the SPECT dataset, and the performance gets a lot better already, so I wonder what you get this way for KRKPA7.
BTW, I have also implemented the version without using secure Numpy arrays if you want to compare with it.
Yeah, great, maybe you can also report the timings that you get for both versions here?
And it would be nice to cover the nonbinary case for the class attribute. I don't think that's hard to add. Then all datasets in demos/data/id3
can be used as is. For that also the case that the class attribute is not the last one needs to be handled, as the above code simply appends a 0
to the unit vector k
.
Sure, I will first put the initial result for the binary case here. (avg 5 times here)
Yeah, great, maybe you can also report the timings that you get for both versions here?
Yes, as you said, it's not hard to add. I will do it later since I have some deadlines recently.
And it would be nice to cover the nonbinary case for the class attribute. I don't think that's hard to add. Then all datasets in demos/data/id3 can be used as is. For that also the case that the class attribute is not the last one needs to be handled, as the above code simply appends a 0 to the unit vector k.
I tried to improve the performance using the code in
np_*.py
. I have encountered a format issueTypeError: can't multiply sequence by non-int of type 'ArraySecInt42'
.S_A
is a list of normal SecInt data, whileT
is np format.S_A
is 2D list. I tried to usenp.concatenate
etc. to convertS_A
but failed.The normal version without using np has been tested successfully. If you want to test the np version to check the error, I can provide the code