Implement np_shuffle - Githubissues

MarcT0K commented 1 year ago

Implements the shuffle coroutine for numpy-like arrays

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.07% :tada:

Comparison is base (1636d55) 92.88% compared to head (1c80c06) 92.95%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #49 +/- ## ========================================== + Coverage 92.88% 92.95% +0.07% ========================================== Files 16 16 Lines 8869 8918 +49 ========================================== + Hits 8238 8290 +52 + Misses 631 628 -3 ``` | [Files Changed](https://app.codecov.io/gh/lschoe/mpyc/pull/49?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Berry+Schoenmakers) | Coverage Δ | | |---|---|---| | [mpyc/random.py](https://app.codecov.io/gh/lschoe/mpyc/pull/49?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Berry+Schoenmakers#diff-bXB5Yy9yYW5kb20ucHk=) | `100.00% <100.00%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/lschoe/mpyc/pull/49/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Berry+Schoenmakers)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

MarcT0K commented 1 year ago

While the test passes, I have some doubts about the implementation of the function np_random_unit_vector. I naively extended the function random_unit_vector but a condition is not covered by the test (i.e., elif await runtime.output(u[0] * x[i]):).

When is this condition supposed to be covered? I want to verify that the function covers all cases correctly?

By the way, I copied a TODO which was already present in random_unit_vector. If I have more info about this TODO, I may try to fix it in this PR.

abdelrahamanaly commented 1 year ago

This is not a comment on the PR, rather a small enquire. Is the protocol you use for np_shuffle published somewhere?

MarcT0K commented 1 year ago

The function np_shuffle is simply an adaptation of the already implemented shuffle function but with numpy-like arrays.

As far as I understand the function shuffle, the intuition is to generate a random permutation matrix and to multiply it to the list you want to shuffle. I doubt there is a reference paper for such a simple idea but a similar protocol is described in Lu and Kate (see Variant 1). While this protocol is not highly optimized, it has the advantage of being rather generic. Shuffling protocols such as Laur et al. are much more efficient but they require a bit more work to be implemented for an unbounded number of parties. Laur et al. also has a poor scaling with the number of parties.

My future plan would be to implement in MPyC the variant 3 of Lu and Kate (which is more efficient than Variant 1 and is as generic) and the Laur et al. for 3-party shuffle (i.e., np_shuffle would call this shuffle if the number of parties is 3).

lschoe commented 1 year ago

Thanks for your PR @MarcT0K. A np_shuffle is indeed a useful addition.

Technically there are some things to discuss. A direct generalization of the shuffle function in mpc.random, as you are proposing, makes sense as a basic solution, and could be upgraded later for more efficiency. For this I'll also consider the work done with Niels de Vreede and Noah van der Meer, like we've discussed before, see https://github.com/lschoe/mpyc/issues/37#issuecomment-1323311271.

So, @abdelrahamanaly, this basic solution is a Fisher-Yates (or, Knuth) shuffle) as discussed in SecretSantaExplained.ipynb. The generation of the random unit vectors are done ahead of time, but the actual shuffle is done in n-1 sequential steps, so that amounts to O(n) round complexity.

I see some issues with the current PR. Indeed for np_random_unit_vector() I don't think you handle the restart case adequately. We don't want a recursion there and it also isn't correct wrt privacy I think. For now, an easy way out is to omit np_random_unit_vector() and replace it with a call to random_unit_vector() like already happens inside mpc.np_unit_vector() currently: https://github.com/lschoe/mpyc/blob/aa71585a2ded04b8d25f0140b908ce8208d6838f/mpyc/runtime.py#L3600 This is also meant to be replaced later on by a vectorized implementation. To vectorize this code, however, the restart case needs to be handled differently.

And we also need to discuss the np_shufflea bit more. One thing is that you need to write x = runtime.np_update(x, i, x_u) instead of just runtime.np_update(x, i, x_u). Otherwise things don't work properly when run with multiple parties. You can have a look yourself what the difference is between the two versions.

Also, a question is what kind of shapes for input array a are supported? Is is just 1D and 2D arrays for a. A quick test with a 3D array for a fails. That's due to the use of matmul which works on the innermost dimensions of a only.

MarcT0K commented 1 year ago

Indeed for np_random_unit_vector() I don't think you handle the restart case adequately. We don't want a recursion there and it also isn't correct wrt privacy I think

I re-read the code, and this recursive call might be a mistake in my generalization. I think it should be a np_random_bits() call to respect the structure of random_unit_vector().

One thing is that you need to write x = runtime.np_update(x, i, x_u) instead of just runtime.np_update(x, i, x_u). Otherwise things don't work properly when run with multiple parties. You can have a look yourself what the difference is between the two versions.

Ok, I'll take a look at that. I must admit np_update() is one of the last functions in MPyC that have sometimes unexpected behaviours in my experiments. In my tests, the function seemed to work even with multiples parties but I trust you on that point.

Also, a question is what kind of shapes for input array a are supported? Is is just 1D and 2D arrays for a. A quick test with a 3D array for a fails. That's due to the use of matmul which works on the innermost dimensions of a only.

It works for 1D and 2D arrays but I must admit that I haven't tested for 3(+)D arrays. I naively used the axis swap to extend it but I haven't thought of the matmul problem.

MarcT0K commented 1 year ago

as you are proposing, makes sense as a basic solution, and could be upgraded later for more efficiency. For this I'll also consider the work done with Niels de Vreede and Noah van der Meer, like we've discussed before, see https://github.com/lschoe/mpyc/issues/37#issuecomment-1323311271.

I am very interested by such a discussion. On my side, I have a working version of the Laur et al. shuffle (for 3 parties) and I rencently did a literature review on oblivious shuffling.

MarcT0K commented 1 year ago

I started fixing the code based on your remarks. However, I have one question about the np_update:

At the end of np_shuffle, I use a last np_update:

    runtime.np_update(a, range(a.shape[0]), x)

The goal of this np_update is to shuffle in place. I feel like it is also necessary to have a form of a = np_update(...). However, the assignment would be useless since a is the input array so only the local reference would be impacted by the assignment.

If the assignment is mandatory, I wonder whether I would need to have a different API for np_shuffle compared to shuffle (which shuffle in-place and returns None). My initial goal was to remain consistent with shuffle. This goal might be unnecessary since I already changed the API compared to shuffle: np_shuffle does not require the secure type as input since it is already contained in the array.

lschoe commented 1 year ago

So, indeed for np_shuffle we want the same behavior as NumPy's np.random.shuffle. That's also the goal with other MPyC substitutes for NumPy functions, if reasonably possible.

To get in-place results with secure NumPy arrays a is a bit trickier than for Python lists x of secure numbers say. In the latter case we can just assign to x[i]. But assigning to a[i] is different because we have to wait for a to be available, as a is a mpc.SecureObject itself. The current implementation of mpc.np_udate() kind of works in-place, but for safe processing it is designed to work with the array returned by it. Otherwise things may go wrong because of the asynchronous processing of the various calls to mpc functions,

Maybe this means that a call to np_shuffle(a) should be inside an await to ensure that the in-place step has been completed, before further processing of a; otherwise the later processing may work with the unshuffled version of a.

MarcT0K commented 1 year ago

I transformed the function into an async function and calls x = await mpc.gather(x) before updating a using x. It seems to work (even with multiple parties). I let you tell me whether I missed some edge cases that may cause troubles.

MarcT0K commented 1 year ago

While writing new unit tests to cover 100% of the patch, I removed the following condition:

    if issubclass(sectype, runtime.SecureObject):
        await runtime.returnType((sectype.array, True, (n,)))
    else:
        await runtime.returnType(asyncoro.Future)

I took inspiration from the function np_random_bits for this condition but I didn't manage to activate it in np_shuffle so I removed this condition. When is this condition necessary? Do I need it in np_shuffle?

lschoe commented 1 year ago

Indeed, that case with await runtime.returnType(asyncoro.Future) is not needed here. Such cases only arise for calls inside the MPyC kernel, where arguments are not of type mpc.SecureObject but say of type mpc.FiniteField representing "bare" secret shares, which is done for more efficiency.

lschoe / mpyc

Implement np_shuffle #49

Codecov Report