mhostetter / galois

A performant NumPy extension for Galois fields and their applications
https://mhostetter.github.io/galois/
MIT License
329 stars 30 forks source link

Matrix row reduction is very slow #192

Closed ddddavidee closed 3 years ago

ddddavidee commented 3 years ago

Hi,

I've tried to use this library to row reduce a (quite) big matrix over a finite field. It succeeded and took quite a huge time in comparison with a naïve implementation.

As an example I copy paste the timing from the jupyter notebook.

with: https://www.nayuki.io/page/gauss-jordan-elimination-over-any-field

mat.reduced_row_echelon_form()

CPU times: user 4min 15s, sys: 18.7 ms, total: 4min 15s
Wall time: 4min 15s

with galois:

MM_reduced = MM.row_reduce()

CPU times: user 4h 21min 24s, sys: 3.06 s, total: 4h 21min 27s
Wall time: 4h 21min 28s

the matrix, for reference is sparse and defined over a 128-bit prime, the shape is: 513x1025

ddddavidee commented 3 years ago

maybe can be someway related to #136

mhostetter commented 3 years ago

Hey, thanks for pointing out this issue.

I'll need to run some tests and dig into this. Since .row_reduce() is not JIT compiled, as referenced in #136, it is not as fast as it could be. However, your comparison is against a pure Python, naive implementation (or so it seems), and the performance delta is quite surprising to me. I thought I had a naive implementation as well. Perhaps I overlooked something.

Were the reduced row echelon forms the same?

mhostetter commented 3 years ago

Also, could you provide your example matrix and field (i.e., which prime are you using) for me to test against / with? You can save your matrix with np.save().

The difference may come from the reference implementation converting to REF first and then RREF second. The difference may be observed because your sparse matrix has few pivots. Said differently, my implementation may be performing needless multiplies with zero. More investigation needed.

ddddavidee commented 3 years ago

yes, I can provide a file with the matrix. I'll do later today or tomorrow morning (europe time)

ddddavidee commented 3 years ago

Were the reduced row echelon forms the same?

yes, I tested with the following: print(MM.all() == MM_reduced.all())

ddddavidee commented 3 years ago

Here you have the matrix:

order = secp128r1.order
import galois

GF = galois.GF(order)
matrix_np = np.asarray(MM)
np.savez_compressed("matrix.npz", matrix = matrix_np)

The order is: 340282366762482138443322565580356624661

matrix.zip

ddddavidee commented 3 years ago

I agree. But there is a huge difference between your implementation (around 4 minutes) and the galois one (4 hours). I'll try to modify the galois' one to follow yours to check if there's an improvement...

On Sat, 6 Nov 2021, 18:41 Nayuki, @.***> wrote:

Matrix row reduction's time complexity is inherently O(n3). You can improve the run time by constant factors, but you can't make the procedure fundamentally faster. I think the problem is that your matrix is too big.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mhostetter/galois/issues/192#issuecomment-962485470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMVOION5ZD34T5N5Q7YPBLUKVSETANCNFSM5HK4L75Q .

mhostetter commented 3 years ago

@nayuki I saw a comment from you, but don't see it anymore. I love your website and definitely used it when developing this library. Would love to collaborate in some way.

@ddddavidee I think potentially the issue arises in how the FieldArray is verified in the constructor. Since it uses dtype=object for such large fields, the __new__() method makes sure the objects in the array are Python ints, in the valid range, and so forth. Potentially that overhead is getting called far too often when all the row reduction operations are being performed.

I need to verify this, but that's my hunch.

nayuki commented 3 years ago

@mhostetter Thanks. My comment reflected a misunderstanding of the problem @ddddavidee was facing.

My opinion now is that probably both my algorithm and your galois library algorithm are O(n3), but supposedly yours has a constant factor 100× bigger.

mhostetter commented 3 years ago

The issue does mainly reside where I previously suggested -- unnecessary overhead in "creating" or "setting" arrays. I'm working on this in #200. I already have a 42x speed up. Originally @nayuki's code was ~60x faster. There might be more performance to squeak out.

ddddavidee commented 3 years ago

@nayuki, what I meant is that the same computation took a very different time (~60x times) changing from your code to the one in galois... and I was thinking that there was a problem somewhere...

@mhostetter I pulled the branch and re-tried the computation (with some other small changes, I used lru_cache from functools, and shortcut the 'division-by-one' case, and now the computation for the very same matrix takes less than two minutes (Wall time: 1min 43s)

mhostetter commented 3 years ago

@ddddavidee you beat me to it! Thanks for the pre-emptive testing. 😄

Could you enumerate your additional changes, either here or in the comments of #200? If generally-applicable, I'd like to incorporate them too.

mhostetter commented 3 years ago

@ddddavidee I pushed another performance improvement to that branch. You'll likely see another speed up.

ddddavidee commented 3 years ago

@ddddavidee you beat me to it! Thanks for the pre-emptive testing. smile

:-) :-D Thank you a lot for working on the issue !!!

Could you enumerate your additional changes, either here or in the comments of #200? If generally-applicable, I'd like to incorporate them too.

I simply added few @lru_cache decorators on the add, subtract, divide and multiply functions hoping that could be accelerate the whole computation caching some operations results. (from functools import lru_cache) And in the divide function I added:

if b == 1:
   return a
ddddavidee commented 3 years ago

@ddddavidee I pushed another performance improvement to that branch. You'll likely see another speed up.

yes, even faster. around 40secs

mhostetter commented 3 years ago

@ddddavidee sweet, glad you've seen improvement. I just merged #200 into master. Feel free to close if you feel the issue has been resolved.

ddddavidee commented 3 years ago

Yes, I'm closing it. Thanks a lot!

ddddavidee commented 3 years ago

@ddddavidee you beat me to it! Thanks for the pre-emptive testing. smile

Could you enumerate your additional changes, either here or in the comments of #200? If generally-applicable, I'd like to incorporate them too.

with this code, the row_reduce() method is 2x faster. The code is very very inspired by the implementation by @nayuki This is a very fast adaptation and can for sure be optimized, (var p and numpivots are essentially the same)

def row_reduce(A, ncols=None):
    if A.ndim != 2:
        raise ValueError(f"Only 2-D matrices can be converted to reduced row echelon form, not {A.ndim}-D.")

    ncols = A.shape[1] if ncols is None else ncols
    nrows = A.shape[0]
    A_rre = A.copy()
    p = 0  # The pivot

    numpivots = 0
    for j in range(ncols):
        if numpivots >= nrows:
            break
        # Find a pivot in column `j` at or below row `p`
        idxs = np.nonzero(A_rre[p:,j])[0]
        if idxs.size == 0:
            continue
        i = p + idxs[0]  # Row with a pivot
        numpivots += 1

        # Swap row `p` and `i`. The pivot is now located at row `p`.
        A_rre[[p,i],:] = A_rre[[i,p],:]

        # Force pivot value to be 1
        A_rre[p,:] /= A_rre[p,j]

        # Force zeros above and below the pivot
        idxs = np.nonzero(A_rre[p:,j])[0].tolist()
        A_rre[idxs,:] -= np.multiply.outer(A_rre[idxs,j], A_rre[p,:])

        p += 1
        if p == A_rre.shape[0]:
            break

    p = 0
    for i in reversed(range(numpivots)):
        idxs = np.nonzero(A_rre[i, p:])[0]
        if idxs.size == 0:
            continue

        idxs = np.nonzero(A_rre[i,:p])[0].tolist()
        A_rre[idxs, :] -= np.multiply.outer(A_rre[idxs, j], A_rre[p, :])

        p += 1
        if p == A_rre.shape[1]:
            break

    return A_rre