How to transform the normal SecInt data into np format?

DylanWangWQF commented 1 year ago

I tried to improve the performance using the code in np_*.py. I have encountered a format issue TypeError: can't multiply sequence by non-int of type 'ArraySecInt42'.

T = secint.array(np.ones(n, dtype='O'))
S = [secint.array(np.array([[t[i] == j for t in transactions] for j in attr_ranges[i]]))
         for i in range(d)]
S_A = [[mpc.in_prod([S[k][j][l] for k in RR], ii) for l in range(len(S[0][0]))] for j in range(ell)]
T_SA = T * S_A  # mpc.schur_prod

S_A is a list of normal SecInt data, while T is np format. S_A is 2D list. I tried to use np.concatenate etc. to convert S_A but failed.

The normal version without using np has been tested successfully. If you want to test the np version to check the error, I can provide the code

DylanWangWQF commented 1 year ago

Another question: does the id3 implemented in mpyc support processing big dataset (e.g., number of data samples = 50,000, features=14).

lschoe commented 1 year ago

Maybe first your question about larger datasets. Numbers like 50,000 samples should be fine. To quickly experiment with this you can insert transactions *= 20 to take multiple copies of a given dataset right before line 110 in the demo:

https://github.com/lschoe/mpyc/blob/61752b75d63815fe991d0ccf2d4614dd2f3c0f41/demos/np_id3gini.py#L110

Running the demo python np_id3gini.py -i 4 -l 96 -M 3 then still finishes within 30 seconds, however this time for 63920 samples instead of 3196 samples. The resulting decision tree is still the same. But note the -l 96 switch where we override the preset bit length of l=69 for dataset i=4, to enlarge the range of values for the secure integer type sufficiently for the MPC calculations.

lschoe commented 1 year ago

Regarding your other question about the use of T * S_A. Indeed, to make this work S_A should be a secure NumPy array, like T already is. The error message says that T is not an integer, which is required when using * with a Python list (like we did above with the assignment transactions *= 20 in which transactions is a list).

You can see how to compute S_A directly using operations on secure Numpy arrays. For similar code, take a close look at how things are done in the np_solver.py demo. There, matrix multiplication @ is used to compute dot products with unit vectors, etc.

DylanWangWQF commented 1 year ago

Hi @lschoe I have implemented the version of SID3T using secure Numpy arrays. The result tree is totally the same as that generated in id3gini.py and np_id3gini.py. But the performance is not expected. Note that I currently only test for binary features. ForSPECT dataset, it's actually faster than the previous result (~35s). But for KRKPA7 dataset (35 features, remove one nonbinary feature), it tasks over ~300s which is even much slower than the version without using Numpy arrays.

If you want to test the runtime, here is the code of np_SID3T.py.txt, and the binary dataset: KRKPA7_36.csv, binary_tennis.csv. SPECT is originally the dataset with binary features.

lschoe commented 1 year ago

Hi @DylanWangWQF

Right very good, your code works as intended. To get better performance it's important to avoid unnecessary conversions (from list over secure integers to secure arrays), as these can be rather time consuming even if there's no actual secure computation involved.

You can use the following lines as replacement for your code:

k = CT.argmax(key=SecureFraction, raw=True, raw2=False)
A = list(RR)[await mpc.output(k @ np.arange(len(k)))]
logging.info(f'Attribute node {A}')
R_A = R + k
S_A = (S.swapaxes(0, 2) @ np.concatenate((k, np.array([0])))).swapaxes(0, 1)

also using this line to initialize S:

S = secint.array(np.array([[[t[i] == j for t in transactions] for j in attr_ranges[i]] for i in range(d)]))

I've tested this with the SPECT dataset and the performance gets a lot better already, so I wonder what you get this way for KRKPA7.

Note the use of raw = True as input to argmax(), to get the secret index k as a unit vector. (The names raw and raw2 for these keyword arguments will be replaced by more meaningful names soon.)

DylanWangWQF commented 1 year ago

HI @lschoe

This sounds wonderful and is easier to use.

(The names raw and raw2 for these keyword arguments will be replaced by more meaningful names soon.)

I have tested this for KRKPA7 dataset (35 features, remove one nonbinary feature), it only takes 6.45s, which improves a lot (compared to the previous ~300s).

I've tested this with the SPECT dataset, and the performance gets a lot better already, so I wonder what you get this way for KRKPA7.

BTW, I have also implemented the version without using secure Numpy arrays if you want to compare with it.

lschoe commented 1 year ago

Yeah, great, maybe you can also report the timings that you get for both versions here?

And it would be nice to cover the nonbinary case for the class attribute. I don't think that's hard to add. Then all datasets in demos/data/id3 can be used as is. For that also the case that the class attribute is not the last one needs to be handled, as the above code simply appends a 0 to the unit vector k.

DylanWangWQF commented 1 year ago

Sure, I will first put the initial result for the binary case here. (avg 5 times here) 截屏2022-12-20 11 18 13

Yeah, great, maybe you can also report the timings that you get for both versions here?

Yes, as you said, it's not hard to add. I will do it later since I have some deadlines recently.

And it would be nice to cover the nonbinary case for the class attribute. I don't think that's hard to add. Then all datasets in demos/data/id3 can be used as is. For that also the case that the class attribute is not the last one needs to be handled, as the above code simply appends a 0 to the unit vector k.

lschoe / mpyc

How to transform the normal SecInt data into np format? #40