Closed MarcT0K closed 1 year ago
What you are observing is indeed expected behavior. You can also see this for the demo mpyc/demos/sort.py if you run it like this, using -L64
to set the bit length of the secure numbers to 64 bits:
$ python sort.py 256 -L64
$ python sort.py 256 -M 1 -L64
$ python sort.py 256 -M3 -L 64
If -M
is not set (or with -M0
, or with --no-async
), MPyC runs with the asynchronous mode disabled. This means that all steps inside MPyC coroutines are executed immediately, much as in ordinary execution of Python functions.
With -M1
the program runs for one party in asynchronous mode. This means that all steps in MPyC coroutines are scheduled as Tasks, and executed in a FIFO manner using Python's asyncio
event loop.
For -M3
the program must run in asynchronous mode because of the communication with the other parties.
The built-in routine mpc._sort()
for secure sorting will first expand the entire circuit for Batcher's merge-exchange sort, and then start to evaluate all the secure comparisons and swaps. For 1000 integers of 64 bits this is quite some work, and the memory footprint will be substantial. For -M3
your machine has to do this locally for all three parties, so that gets a bit better memorywise when running the program with three parties on remote machines.
To lower the memory footprint reducing the size of the secure integers should help a bit (default is L=32).
There will also be a vectorized version using secure NumPy arrays for sorting, which will reduce the memory footprint considerably. Achieving the same effect as already done for instance for mpc.np_argmax()
versus mpc.argmax()
. Hopefully later this month ...
Thank you, I better understand the phenomenon now. I would have a follow-up question: is there a way to control the expansion of the circuit?
For example, I envision two possibilities:
These solutions create a memory-footprint/round-complexity trade-off which is necessary for large lists.
The methods mpc.barrier()
and mpc.throttler()
are available to limit the circuit expansion. The use of barriers is mentioned in the explanation of the cnnmnist.py demo. You can also search for "barrier" and "throttler" in mpyc/demos/*.py
to see some more examples.
I've done a quick experiment already with secure NumPy arrays, and then sorting 1000 numbers with 3 parties takes like 10 seconds. The circuit size also remains much smaller, linear in the depth of the sorting network, so much less of an issue.
Thanks, these functions should do the job.
Wow, this is a nice speed up. Did you just parallelize the batcher sort or did you also switch to a quicksort or a radix sort?
It's a vectorized version of Batcher's merge-exchange sort algorithm as used by mpc.sort()
already. Here's a preview of the method that will be added to mpyc.runtime.Runtime
:
def np_sort(self, a, axis=-1):
if axis is None:
a = self.np_flatten(a)
axis = 0
n = a.shape[axis]
if a.size == 0 or n <= 1:
return a
a = self.np_swapaxes(a, axis, -1) # switch to last axis
t = (n-1).bit_length() # n >= 2, so t >= 1
p = 1 << t-1
while p:
d, q, r = p, 1 << t-1, 0
while d:
I = np.fromiter((i for i in range(n - d) if i & p == r), dtype=int)
b0 = a[..., I]
b1 = a[..., I + d]
h = (b0 >= b1) * (b0 - b1)
b0, b1 = b0 - h, b1 + h
a = self.np_update(a, (..., I), b0)
a = self.np_update(a, (..., I + d), b1)
d, q, r = q - p, q >> 1, p
p >>= 1
a = self.np_swapaxes(a, axis, -1) # restore original axis
return a
Looks nice! Looking forward to the next release!
Hello, I've run some experiments where I sort "medium-size" secure lists (i.e., around 1K elements). For example, see the following snippet:
If I execute my script with no argument or with
-M0
, the script finishes without any issue. If I use-M3
or even-M1
, the Python process takes more and more memory. Sometimes, it even fills my memory entirely for larger lists or larger M. On the other hand, with-M0
, the memory footprint is constant (and much smaller).I suppose this behaviour comes from the secret sharing costs but I wonder whether I can optimize it to reduce the memory footprint. For example, I see several nested loops in the
_sort()
function and wonder whether the coroutine stores the "comparison shares" from all comparison rounds until the end instead of releasing some of them round by round.To sum up, is there a simple trick I missed in the documentation or the code to optimize the memory footprint in these case?