try to fix assembly to get O(N) scaling

sampotter commented 2 years ago

Here's a fix for the O(N^2) scaling problem Stefano observed.

For reference: the old algorithm did the following when trying to compute a compressed block of the form factor matrix:

Compute the true form factor matrix in CSR format
Estimate its numerical rank (find k such that sigma[k]/sigma[0] < tol)
Compute the truncated SVD with k terms
If nbytes SVD < nbytes CSR, store the SVD, otherwise go down a level in the quadtree and do this recursively (don't go down further if the blocks are smaller than a certain fixed size)

The problem with this: for really fine meshes with rough topography (unlike the spherical crater, which doesn't have this problem since the true continuous form factor integral is just "rank 1"!) we eventually find that very large blocks (at the top of the tree) can be stored in less spacing using SVDs of size O(N^2) than the CSR blocks, resulting in O(N^2) scaling of MVP time and space overall.

The main change: at each level, we build the CSR block, the SVD block, and the "child block" (by descending another level recursively). We compare the size of all three and use the smallest.

Here are the results of running the Gerlache example:

gerlache_plots.zip

steo85it commented 2 years ago

The code looks good/clear (thanks!) and I'm repeating previous tests on the cluster (single machine, for now) to see how performance changes.

steo85it commented 2 years ago

Test results start to look reasonably good, but it's hard to extrapolate a power law for the moment. At 50 mpp (second to last point) the eps=1e-2 curve seemed to follow the true FF, while at 30 mpp things look better (I have an additional point running at 20mpp, but it's long...). FFs look better, at least: it seems that the algo explores a bit deeper instead of getting stuck on the first SVD for the non-diagonal elements. In the image below: the upper row has the crater meshed at 400mpp, in the lower at 50mpp; the left column shows new results after the mod, the second column shows the old status. One possible issue that I see is that the algo seems much more efficient in exploring the tree for the 400 mpp (off-diagonal terms aren't all SVDs) while at 50 mpp (and beyond) most of the off-diag elements are kept as SVD (besides very peripheral ones). Could that maybe explain why timings get worse/closer to the true FF at higher resolutions?

p.s.: please ignore minor differences btw the two tests at 400 mpp, I can explain those and they are no issue.

sampotter commented 2 years ago

One thing that could speed everything up is to make sure the min_size parameter is not too small. The performance characteristics of the compressed FF matrix should be sort of like a parabola with respect to min_size: too small => many tiny blocks, more overhead; too big => not enough compression applied. The default is 16384, but this could be played with. (Basically, end effect of messing with min_size should be to shift the red and purple graphs up or down... Optimal min_size = lowest position, if that makes sense...)

Do you mind adding a plot of the matrix sizes?

steo85it commented 2 years ago

Sure, here it is (left old, right post mods). Sorry if the left plot shows only some (and not the same) points, I was making additional tests at the time, but the general picture should be clear. And yes, good point, but the "too small" should mostly affect the FF compression rather than its "use" for the products, right? Or maybe both, actually, I'm not sure. Sizes look similar between the old and new approach, and for sure a strength of the method.

nschorgh commented 2 years ago

I'm wondering why the prefactor for the time to compute T is so high for the compressed matrix. The amount of arithmetic for the matrix-vector multiply should be smaller, not higher, even for small N. Presumably this is a data movement issue. One could tell simply by looking at the CPU load, which is close to 100% (per thread) if the calculation is dominated by arithmetic and less if the bottleneck lies elsewhere.

sampotter commented 2 years ago

Sure, here it is (left old, right post mods). Sorry if the left plot shows only some (and not the same) points, I was making additional tests at the time, but the general picture should be clear.

This is a bit strange... I would have expected the old plots to also show O(N^2) scaling for compressed size. Now I'm not so sure what's going on. :-) My hypothesis was that the "big SVDs" were growing like O(N^2) in size (=> O(N^2) MVPs...)... but if they are not... then it may be a bug in the old code... Not sure.

And yes, good point, but the "too small" should mostly affect the FF compression rather than its "use" for the products, right? Or maybe both, actually, I'm not sure.

It would affect both the assembly and the MVPs (time and space).

sampotter commented 2 years ago

I'm wondering why the prefactor for the time to compute T is so high for the compressed matrix. The amount of arithmetic for the matrix-vector multiply should be smaller, not higher, even for small N. Presumably this is a data movement issue. One could tell simply by looking at the CPU load, which is close to 100% (per thread) if the calculation is dominated by arithmetic and less if the bottleneck lies elsewhere.

It is quite possible that the prefactor relates to data movement. Python will be very inefficient here compared to C or Fortran.

steo85it commented 2 years ago

Mmm, maybe it's time for a bit of profiling again? To see if there are some "data movement" spots that are uselessly slow? And no, sizes always looked fine (which I thought was to be expected by reducing large off-diag blocks to SVDs, but sure, it depends how many terms are kept).

The amount of arithmetic for the matrix-vector multiply should be smaller, not higher, even for small N. Presumably this is a data movement issue.

Is this really what we expect for small N? If yes, it would make testing, profiling, optimizing much easier (instead of recomputing an FF for 12h at each modification :) ). Anyway, collecting some profiling info now.

steo85it commented 2 years ago

ok, 95% of the time spent for computeT with the compressed FF is related to calling matmat functions in compressed_form_factors.py, as expected. In particular, this function is called

    def _matmat(self, x):
        y = np.zeros((self.shape[0], x.shape[1]), dtype=self.dtype)
        for i, row_inds in enumerate(self._row_block_inds):
            for j, col_inds in enumerate(self._col_block_inds):
                block = self._blocks[i, j]
                if block.is_empty_leaf: continue
                y[row_inds] += block@x[col_inds]     **<< this line!**
        return y

Then I can trace 50% of the time to the matmat for the different kind of blocks (svd: 70 us/hit ; sparse: 20 us/hit ; dense 36 us/hit). I don't know where the other 50% of the time is spent (maybe going through blocks/levels? which function would that be?) Going on to profile the true FF.

sampotter commented 2 years ago

I’m about to teach but I’ll take a look after… one quick thing to check: is the O(N^2) time for solve_T only or also the FF MVP as well? (What is the time scaling of a single FF MVP?)

On Wednesday, April 6, 2022, Stefano Bertone @.***> wrote:

ok, 95% of the time spent for computeT with the compressed FF is related to calling matmat functions in compressed_form_factors.py, as expected. In particular, this function is called
def _matmat(self, x):
    y = np.zeros((self.shape[0], x.shape[1]), dtype=self.dtype)
    for i, row_inds in enumerate(self._row_block_inds):
        for j, col_inds in enumerate(self._col_block_inds):
            block = self._blocks[i, j]
            if block.is_empty_leaf: continue
            y[row_inds] += ***@***.***[col_inds]     **<< this line!**
    return y
Then I can trace 50% of the time to the matmat for the different kind of blocks (svd: 70 us/hit ; sparse: 20 us/hit ; dense 36 us/hit). I don't know where the other 50% of the time is spent (maybe going through blocks/levels? which function would that be?) Going on to profile the true FF.

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sampotter_python-2Dflux_pull_61-23issuecomment-2D1090643212&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=UxUoBJ0d_Ia00LnPo8DrYQ&m=-R6A4VeMYI2ESXLIf6LT-4tRijjao2p69rgbEIDhU1j2d1mJfZSPhOJyeF8RqYbT&s=1hdFlcYed7zH7AchYMBLTaGNIK5K7WGP5LuATRYQSBk&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABOBPI5HJ6HH4IQDTPYWZPTVDXNTLANCNFSM5R6TZJHA&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=UxUoBJ0d_Ia00LnPo8DrYQ&m=-R6A4VeMYI2ESXLIf6LT-4tRijjao2p69rgbEIDhU1j2d1mJfZSPhOJyeF8RqYbT&s=bL4PaPFEVc8T51a39BdZzITVlsozm_xsKL19-fegKe4&e= . You are receiving this because you authored the thread.Message ID: @.***>

steo85it commented 2 years ago

It's the whole T = compute_steady_state_temp(FF, E, rho, emiss) which includes two solve_radiosity calls. That's it.

steo85it commented 2 years ago

Also, where/how does the B1 = E + FF@(rho*B) happen for the true FF? I couldn't find where the time is spent (is it computed in place with the standard matmul/@?)

sampotter commented 2 years ago

It's the whole T = compute_steady_state_temp(FF, E, rho, emiss) which includes two solve_radiosity calls. That's it.

OK, thanks.

If you don't mind, could you try changing the lines like:

B = solve_radiosity(FF, E, rho, 'right', method)[0]

to:

B, niter = solve_radiosity(FF, E, rho, 'right', method)

and seeing what the total number of iterations is? I believe the total number of iterations should be O(1) independent of the mesh resolution, but this isn't something we've actually checked... (That is, count the total number of MVPs done by compute_steady_state_temp...)

It would also be helpful to make a timing plot directly comparing the MVP times.

Also, where/how does the B1 = E + FF@(rho*B) happen for the true FF? I couldn't find where the time is spent (is it computed in place with the standard matmul/@?)

Since FF would be an instance of scipy.sparse.csr_matrix in that case, something like scipy.sparse.csr_matrix.__mulmul__ would carry out the MVP (or whatever the matrix multiplication function is called).

ok, 95% of the time spent for computeT with the compressed FF is related to calling matmat functions in compressed_form_factors.py, as expected. In particular, this function is called

Hmm... nothing about this is surprising...

Let me know what I can do to help...

steo85it commented 2 years ago

yes, there are 7 iters (for E, and 12-13 for IR) independently of the compression (or true) or resolution (at least for the meshes down to 100 mpp). I would be curious to know where that 50% of the time spent within each @ product that I cannot trace is spent... I think it's just "getting the blocks", but I'm not sure which function to decorate (I could just decorate them all, but well).

Since FF would be an instance of scipy.sparse.csr_matrix in that case, something like scipy.sparse.csr_matrix.mulmul would carry out the MVP (or whatever the matrix multiplication function is called).

And ok, if it's the standard sparse product, then it makes sense that I can't trace it, and I confirm that it's much faster (30%) than the one used for the compressed FF for low resolution problems (400 mpp). Not sure if that's expected or not, it's more elements, but probably accessed/treated in a more efficient way.

steo85it commented 2 years ago

we could also merge this one, as far as I'm concerned: it didn't help much, but it didn't hurt and it should be an improvement, however marginal (maybe the impact will be stronger once we solve the "block loop" slowdown).

sampotter commented 2 years ago

Agreed. Done

sampotter / fluxpy

try to fix assembly to get O(N) scaling #61