Memory footprint of sorting medium-size list with M>= 1

MarcT0K commented 1 year ago

Hello, I've run some experiments where I sort "medium-size" secure lists (i.e., around 1K elements). For example, see the following snippet:

async def main():
    sectype = mpc.SecInt(64)
    await mpc.start()
    if mpc.pid == 0:
        rand_list = [random.randint(0, 1024) for _ in range(1000)]
    else:
        rand_list = None
    rand_list = await mpc.transfer(rand_list, senders=0)

    sec_list = [sectype(i) for i in rand_list]
    sorted = mpc.sorted(sec_list)
    print(await mpc.output(sorted))
    await mpc.shutdown()

If I execute my script with no argument or with -M0, the script finishes without any issue. If I use -M3 or even -M1, the Python process takes more and more memory. Sometimes, it even fills my memory entirely for larger lists or larger M. On the other hand, with -M0, the memory footprint is constant (and much smaller).

I suppose this behaviour comes from the secret sharing costs but I wonder whether I can optimize it to reduce the memory footprint. For example, I see several nested loops in the _sort() function and wonder whether the coroutine stores the "comparison shares" from all comparison rounds until the end instead of releasing some of them round by round.

To sum up, is there a simple trick I missed in the documentation or the code to optimize the memory footprint in these case?

lschoe commented 1 year ago

What you are observing is indeed expected behavior. You can also see this for the demo mpyc/demos/sort.py if you run it like this, using -L64 to set the bit length of the secure numbers to 64 bits:

$ python sort.py 256 -L64
$ python sort.py 256 -M 1 -L64
$ python sort.py 256 -M3 -L 64

If -M is not set (or with -M0, or with --no-async), MPyC runs with the asynchronous mode disabled. This means that all steps inside MPyC coroutines are executed immediately, much as in ordinary execution of Python functions.

With -M1 the program runs for one party in asynchronous mode. This means that all steps in MPyC coroutines are scheduled as Tasks, and executed in a FIFO manner using Python's asyncio event loop.

For -M3 the program must run in asynchronous mode because of the communication with the other parties.

The built-in routine mpc._sort() for secure sorting will first expand the entire circuit for Batcher's merge-exchange sort, and then start to evaluate all the secure comparisons and swaps. For 1000 integers of 64 bits this is quite some work, and the memory footprint will be substantial. For -M3 your machine has to do this locally for all three parties, so that gets a bit better memorywise when running the program with three parties on remote machines.

To lower the memory footprint reducing the size of the secure integers should help a bit (default is L=32).

There will also be a vectorized version using secure NumPy arrays for sorting, which will reduce the memory footprint considerably. Achieving the same effect as already done for instance for mpc.np_argmax() versus mpc.argmax(). Hopefully later this month ...

MarcT0K commented 1 year ago

Thank you, I better understand the phenomenon now. I would have a follow-up question: is there a way to control the expansion of the circuit?

For example, I envision two possibilities:

Limit the number of actives coroutines: so the CPU does not expand the entire circuit.
A method to wait that all ongoing coroutines are done. For example, I could reimplement the sort algorithm and call the waiting statement at the end of a given loop iteration. This solution would be more manual but would prevent the memory overflow.

These solutions create a memory-footprint/round-complexity trade-off which is necessary for large lists.

lschoe commented 1 year ago

The methods mpc.barrier() and mpc.throttler() are available to limit the circuit expansion. The use of barriers is mentioned in the explanation of the cnnmnist.py demo. You can also search for "barrier" and "throttler" in mpyc/demos/*.py to see some more examples.

I've done a quick experiment already with secure NumPy arrays, and then sorting 1000 numbers with 3 parties takes like 10 seconds. The circuit size also remains much smaller, linear in the depth of the sorting network, so much less of an issue.

MarcT0K commented 1 year ago

Thanks, these functions should do the job.

Wow, this is a nice speed up. Did you just parallelize the batcher sort or did you also switch to a quicksort or a radix sort?

lschoe commented 1 year ago

It's a vectorized version of Batcher's merge-exchange sort algorithm as used by mpc.sort() already. Here's a preview of the method that will be added to mpyc.runtime.Runtime:

    def np_sort(self, a, axis=-1): 
        if axis is None:
            a = self.np_flatten(a)
            axis = 0
        n = a.shape[axis]
        if a.size == 0 or n <= 1:
            return a

        a = self.np_swapaxes(a, axis, -1)  # switch to last axis
        t = (n-1).bit_length()  # n >= 2, so t >= 1
        p = 1 << t-1
        while p:
            d, q, r = p, 1 << t-1, 0
            while d:
                I = np.fromiter((i for i in range(n - d) if i & p == r), dtype=int)
                b0 = a[..., I]
                b1 = a[..., I + d]
                h = (b0 >= b1) * (b0 - b1)
                b0, b1 = b0 - h, b1 + h
                a = self.np_update(a, (..., I), b0)
                a = self.np_update(a, (..., I + d), b1)
                d, q, r = q - p, q >> 1, p
            p >>= 1
        a = self.np_swapaxes(a, axis, -1)  # restore original axis
        return a

MarcT0K commented 1 year ago

Looks nice! Looking forward to the next release!

lschoe / mpyc

Memory footprint of sorting medium-size list with M>= 1 #44