legate numpy very slow compared to Python+Numpy

laytonjbgmail commented 3 years ago

I've been testing a simple Laplace Eq. solver to compare Python+Numpy to legate.numpy and legate is hugely slower than Numpy.

The code is taken from: https://barbagroup.github.io/essential_skills_RRC/laplace/1/ . The code I actually run is the following:

import numpy as np
import time

def L2_error(p, pn):
    return np.sqrt(np.sum((p - pn)**2)/np.sum(pn**2))
# end if

def laplace2d(p, l2_target):
    '''Iteratively solves the Laplace equation using the Jacobi method

    Parameters:
    ----------
    p: 2D array of float
        Initial potential distribution
    l2_target: float
        target for the difference between consecutive solutions

    Returns:
    -------
    p: 2D array of float
        Potential distribution after relaxation
    '''

    l2norm = 1.0
    icount = 0
    tot_time = 0.0
    pn = np.empty_like(p)
    while l2norm > l2_target:

        start = time.perf_counter()

        icount = icount + 1
        pn = p.copy()
        p[1:-1,1:-1] = .25 * (pn[1:-1,2:] + pn[1:-1, :-2] \
                              + pn[2:, 1:-1] + pn[:-2, 1:-1])

        ##Neumann B.C. along x = L
        p[1:-1, -1] = p[1:-1, -2]     # 1st order approx of a derivative 
        l2norm = L2_error(p, pn)
        end = time.perf_counter()

        tot_time = tot_time + (end-start)

    # end while

    print("l2norm = ",l2norm)
    print("icount = ",icount)
    print("Total Iteration Time = ",tot_time)
    print("   Time per iteration = ",tot_time/icount)

    return p
# end if

if __name__ == "__main__":

    nx = 401
    ny = 401

    # Initial conditions
    p = np.zeros((ny,nx)) ##create a XxY vector of 0's

    # Dirichlet boundary conditions
    x = np.linspace(0,1,nx)
    p[-1,:] = np.sin(1.5*np.pi*x/x[-1])
    del x

    start = time.time()
    p = laplace2d(p.copy(), 1e-8)
    stop = time.time()

    print("Elapsed time = ",(stop-start)," secs")
    print(" ")

# end if

When I run it on my laptop with Anaconda Python3 and Numpy I get the following:

$ python3 jacobi.py 
l2norm =  9.99986062249016e-09
icount =  153539
Total Iteration Time =  127.02529454990054
   Time per iteration =  0.0008273161512703648
Elapsed time =  127.14257955551147  secs

When I change the import line to legate.numpy, I usually stop the code after 15 minutes of wall time. I have let it run for up to 60 minutes and it never converges.

As a check, I've run the Numpy code with legate itself and it exactly matches the Numpy results.

I have been experimenting with replacing the l2norm computations with numpy specific functions (np.subtract, np.square, etc.) but I have achieved no increase in performance.

Does anyone have any recommendations?

Thanks!

Jeff

(edit by Manolis: added some formatting for the code sections)

piyueh commented 3 years ago

~~I haven't run this script, but I highly suspect the problem is caused by this line~~

p[1:-1, -1] = p[1:-1, -2]

If so, this is related to issue #16.

update: Issue #16 indeed affects the iterations, but not the performance. The solution converges to a wrong answer because issue #16 results in the wrong boundary condition at x=L. Nevertheless, it still converges to something with a convergence history similar to the correct one. So the performance issue is cause by something else.

lightsighter commented 3 years ago

I actually don't think issue #16 is impacting correctness in this case. This is a basic indexing copy on both sides and we get the transforms right in those cases. It's only when you mix in advanced indexing that we run into problems like those in #16.

That said, I've been unable to reproduce this issue as described. I ran the code above with a couple of different configurations and couldn't make it take more than 4 minutes. In all cases we converged in the same number of iterations as the original implementation with the same L2-norm value.

mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --cpus 1
l2norm =  9.99986062249016e-09
icount =  153539
Total Iteration Time =  223.62144905142486
   Time per iteration =  0.0014564472157004074
Elapsed time =  223.82319235801697  secs

mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --cpus 4
l2norm =  9.99986062249016e-09
icount =  153539
Total Iteration Time =  220.62398342695087
   Time per iteration =  0.0014369247124636144
Elapsed time =  220.76576781272888  secs

mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --gpus 1
l2norm =  9.99986062249016e-09
icount =  153539
Total Iteration Time =  215.94193586241454
   Time per iteration =  0.001406430521642153
Elapsed time =  216.0937101840973  secs

While this is certainly slower than NumPy, it's not the >15 minutes originally reported. This kind of slow-down (~2-3X) compared to NumPy is actually expected behavior for Legate on tiny problems like this. The extra levels of indirection needed to allow Legate to scale to thousands of GPUs comes with some overhead, and on small problems like this that overhead dominates performance. Legate really won't become efficient until it's processing 100s of MBs of data, and even then, we'll do much better when we're handling at least a few GBs of data on every node. In fact, I'm pretty sure this problem isn't even big enough to trigger execution with Legion. I think our extra-level of indirection is just looking at the data and then turning around and handling it straight back to NumPy, but the cost of doing that logic on every NumPy operation is about 2-3X more than even running the math for the operation in this case, hence the slowdown.

piyueh commented 3 years ago

My test result is similar to @laytonjbgmail's original post. After changing the import numpy as np to from legate import numpy as np, the run becomes very slow. When using vanilla NumPy and nx = ny = 51, the run took 0.23 seconds, while the legate version took 27 seconds with legate --cpus 1 laplace.py. I didn't test with the original mesh size (i.e., nx = ny = 401) because it took too long to finish.

Not only the run is slow, but the solution is also wrong. Adding this assertion assert np.allclose(p[1:-1, -1], p[1:-1, -2]) (either after p = laplace2d(...) or after p[1:-1, -1] = p[1:-1, -2]) raises the error and shows that the copying wasn't successful. In addition, when comparing the result to the analytical solution (not shown in the original post), the relative L2-error is 35%, while that of using vanilla NumPy is just 1.8% (when the meshsize is nx = ny = 51).

@lightsighter If you want to try my version of test, here is the link: https://gist.github.com/piyueh/c426132063f398b0557bb7c0ee01c572

My version basically just adds a check function to compare the result with the analytical solution. (This specific Laplace equation has an analytical solution.)

lightsighter commented 3 years ago

I was previously unable to reproduce the issue because my try - catch block wrapping my import statement of Legate NumPy was masking a configuration error.

I was able to observe a slow-down (although not as extreme as what either of you were reporting before). The problem came in the creation of array views. One of our caching data structures was not firing and as a result we were dumping a bunch of unnecessary partitioning calls down onto the Legion runtime. I pushed a fix with e24dbddbb6. When I run now, I see the following time needed to do 150K iterations:

mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --cpus 1
l2norm =  1.1475736468288804e-07
icount =  150000
Total Iteration Time =  741.8952921638265
   Time per iteration =  0.00494596861442551
Elapsed time =  742.4770996570587  secs

That's about 3.3X slower than vanilla NumPy on the same machine. I looked at some profiling and I think this is as fast as the array-view creation code will go for the moment. There are some features planned for the Legate ecosystem as a whole that will make this considerably faster in the 6 month time frame, but until that time, my original point about the size of workloads still stands. The ability to make "views" that work on big distributed systems doesn't come for free. Those overheads need to be amortized over the size of the data being processed, and right now the lower bar for that is in the 100s of MBs to GBs of data per node at a minimum.

piyueh commented 3 years ago

After e24dbddbb6a0aad483a8a96c513b2d05fd70e073, I can see the improvement of the performance. Here's my result of vanilla NumPy and Legate NumPy (run up to 150000 iterations; mesh size nx=ny=401):

Vanilla NumPy:

l2norm =  1.1833319405787581e-08
icount =  150000
Total Iteration Time =  130.02262105768023
 Time per iteration =  0.0008668174737178682
Elapsed time =  130.13445234298706  secs

Legate NumPy (one cpu core; no openmp):

l2norm =  1.1475736468288804e-07
icount =  150000
Total Iteration Time =  319.63438990681607
 Time per iteration =  0.002130895932712107
Elapsed time =  1243.906419992447  secs

The first thing I noticed is that, in Legate case, the elapsed time is much higher than iteration time. I think it's reasonable from my understanding because Legate's task execution is not synchronized with Python's native time.perf_counter. So when time.perf_counter measures the time during each iteration, the calculations of that iteration are actually not done yet.

However, now I'm not sure my understanding is correct. I don't see the same behavior from @lightsighter's result. This makes me think there's something wrong with my Legate installation or configuration. @lightsighter Any suggestion on what I should check?

Here's my profiling results of only 1000 iterations, not sure if it is helpful: legate_prof.tar.gz

manopapad commented 3 years ago

I looked into this a bit. The size of the original example (401x401) is large enough to trigger accelerated execution, but the amount of computation is not enough to cover the overhead of going through Legate. Also, the main loop checks for convergence on every iteration, meaning that the Legion runtime can't run ahead to schedule computation, so we're not getting any pipelining of runtime analysis > kernel execution.

I changed the code to do the convergence check every 100 iterations and increased the size of the array to 5001 x 5001 (see https://gist.github.com/manopapad/f9d0af92dcf008b7d30d316d466c6baa), and now the times look as follows:

NumPy: 100 iterations in 30.818s
Legate on CPUs: 500 iterations in 33.882s (legate --sysmem 50000 --cpus 32 29.py)
Legate with OpenMP: 500 iterations in 33.215s (legate --sysmem 10000 --numamem 10000 --omps 2 --ompthreads 16 29.py)
Legate on GPUs: 10000 iterations in 42.807s (legate --sysmem 10000 --gpus 1 --fbmem 14500 29.py)

manopapad commented 3 years ago

Does anyone have more to say on this? If not then we can close this issue.

laytonjbgmail commented 3 years ago

Apologies - I got distracted from the responses.

I do have a question - based on your tests, it appears that running the GPU is slower than running on the CPU. Is this your observation too?

Thanks!

Jeff

On Wed, Jul 28, 2021 at 5:31 PM Manolis Papadakis @.***> wrote:

Does anyone have more to say on this? If not then we can close this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888489623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBKVAYOWUJL2YUBA7E3T2A5FNANCNFSM4463WSEQ .

manopapad commented 3 years ago

After increasing the size of the array (so that there's enough computation per iteration to cover the overhead of going through Legate) and doing the convergence check less frequently (so the main code has some leeway to "run ahead" and schedule work for the GPU) I get 14.76 iterations/sec on a single CPU vs 233.61 iterations/sec on a single GPU with legate, i.e. about 16x faster on a GPU.

laytonjbgmail commented 3 years ago

What was the final problem size and how often did you do the convergence check?

Thanks!

Jeff

On Wed, Jul 28, 2021 at 7:19 PM Manolis Papadakis @.***> wrote:

After increasing the size of the array (so that there's enough computation per iteration to cover the overhead of going through Legate) and doing the convergence check less frequently (so the main code has some leeway to "run ahead" and schedule work for the GPU) I get 14.76 iterations/sec on a single CPU vs 233.61 iterations/sec on a single GPU with legate, i.e. about 16x faster on a GPU.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888558105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBOVET2KE3TWTHDRINDT2BJ3NANCNFSM4463WSEQ .

manopapad commented 3 years ago

See https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214.

laytonjbgmail commented 3 years ago

Thanks. I hate to be dense, but I didn't see those numbers in your previous comment. To me it looked like the previous results give the same total time for CPU and GPU. Can you explain this last set of results compared to the previous set?

Thanks!

On Wed, Jul 28, 2021 at 8:48 PM Manolis Papadakis @.***> wrote:

See #29 (comment) https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888611239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBLBA4ZRBYAHK3TX66DT2BULPANCNFSM4463WSEQ .

manopapad commented 3 years ago

My two comments (https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214 and https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888558105) refer to the same runs. I ran both the CPU test and the GPU test for a similar amount of time, but they completed a different number of iterations:

single CPU: 500 iterations in 33.882s => 14.76 iterations/sec single GPU: 10000 iterations in 42.807s => 233.61 iterations/sec

laytonjbgmail commented 3 years ago

Apologies. I didn't look at the number of iterations (don't worry - I slapped my forehead a few times). Thanks for the clarification.

Nothing more from me on the subject.

Thanks for all of your help!

On Thu, Jul 29, 2021 at 5:52 PM Manolis Papadakis @.***> wrote:

My two comments (#29 (comment) https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214 and #29 (comment) https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888558105) refer to the same runs. I ran both the CPU test and the GPU test for a similar amount of time, but they completed a different number of iterations:

single CPU: 500 iterations in 33.882s => 14.76 iterations/sec single GPU: 10000 iterations in 42.807s => 233.61 iterations/sec

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-889342877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBP4JQHZFDL62PNBSZDT2GIMPANCNFSM4463WSEQ .

manopapad commented 3 years ago

No worries, thanks for your interest in Legate!

nv-legate / cunumeric

legate numpy very slow compared to Python+Numpy #29