Closed laytonjbgmail closed 3 years ago
I haven't run this script, but I highly suspect the problem is caused by this line
p[1:-1, -1] = p[1:-1, -2]
If so, this is related to issue #16.
update: Issue #16 indeed affects the iterations, but not the performance. The solution converges to a wrong answer because issue #16 results in the wrong boundary condition at x=L. Nevertheless, it still converges to something with a convergence history similar to the correct one. So the performance issue is cause by something else.
I actually don't think issue #16 is impacting correctness in this case. This is a basic indexing copy on both sides and we get the transforms right in those cases. It's only when you mix in advanced indexing that we run into problems like those in #16.
That said, I've been unable to reproduce this issue as described. I ran the code above with a couple of different configurations and couldn't make it take more than 4 minutes. In all cases we converged in the same number of iterations as the original implementation with the same L2-norm value.
mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --cpus 1
l2norm = 9.99986062249016e-09
icount = 153539
Total Iteration Time = 223.62144905142486
Time per iteration = 0.0014564472157004074
Elapsed time = 223.82319235801697 secs
mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --cpus 4
l2norm = 9.99986062249016e-09
icount = 153539
Total Iteration Time = 220.62398342695087
Time per iteration = 0.0014369247124636144
Elapsed time = 220.76576781272888 secs
mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --gpus 1
l2norm = 9.99986062249016e-09
icount = 153539
Total Iteration Time = 215.94193586241454
Time per iteration = 0.001406430521642153
Elapsed time = 216.0937101840973 secs
While this is certainly slower than NumPy, it's not the >15 minutes originally reported. This kind of slow-down (~2-3X) compared to NumPy is actually expected behavior for Legate on tiny problems like this. The extra levels of indirection needed to allow Legate to scale to thousands of GPUs comes with some overhead, and on small problems like this that overhead dominates performance. Legate really won't become efficient until it's processing 100s of MBs of data, and even then, we'll do much better when we're handling at least a few GBs of data on every node. In fact, I'm pretty sure this problem isn't even big enough to trigger execution with Legion. I think our extra-level of indirection is just looking at the data and then turning around and handling it straight back to NumPy, but the cost of doing that logic on every NumPy operation is about 2-3X more than even running the math for the operation in this case, hence the slowdown.
My test result is similar to @laytonjbgmail's original post. After changing the import numpy as np
to from legate import numpy as np
, the run becomes very slow. When using vanilla NumPy and nx = ny = 51
, the run took 0.23 seconds, while the legate version took 27 seconds with legate --cpus 1 laplace.py
. I didn't test with the original mesh size (i.e., nx = ny = 401
) because it took too long to finish.
Not only the run is slow, but the solution is also wrong. Adding this assertion assert np.allclose(p[1:-1, -1], p[1:-1, -2])
(either after p = laplace2d(...)
or after p[1:-1, -1] = p[1:-1, -2]
) raises the error and shows that the copying wasn't successful. In addition, when comparing the result to the analytical solution (not shown in the original post), the relative L2-error is 35%, while that of using vanilla NumPy is just 1.8% (when the meshsize is nx = ny = 51
).
@lightsighter If you want to try my version of test, here is the link: https://gist.github.com/piyueh/c426132063f398b0557bb7c0ee01c572
My version basically just adds a check function to compare the result with the analytical solution. (This specific Laplace equation has an analytical solution.)
I was previously unable to reproduce the issue because my try - catch
block wrapping my import
statement of Legate NumPy was masking a configuration error.
I was able to observe a slow-down (although not as extreme as what either of you were reporting before). The problem came in the creation of array views. One of our caching data structures was not firing and as a result we were dumping a bunch of unnecessary partitioning calls down onto the Legion runtime. I pushed a fix with e24dbddbb6. When I run now, I see the following time needed to do 150K iterations:
mbauer:/local/home/mbauer/LegateNumPy$ ../LegateCore/install/bin/legate laplace.py --cpus 1
l2norm = 1.1475736468288804e-07
icount = 150000
Total Iteration Time = 741.8952921638265
Time per iteration = 0.00494596861442551
Elapsed time = 742.4770996570587 secs
That's about 3.3X slower than vanilla NumPy on the same machine. I looked at some profiling and I think this is as fast as the array-view creation code will go for the moment. There are some features planned for the Legate ecosystem as a whole that will make this considerably faster in the 6 month time frame, but until that time, my original point about the size of workloads still stands. The ability to make "views" that work on big distributed systems doesn't come for free. Those overheads need to be amortized over the size of the data being processed, and right now the lower bar for that is in the 100s of MBs to GBs of data per node at a minimum.
After e24dbddbb6a0aad483a8a96c513b2d05fd70e073, I can see the improvement of the performance. Here's my result of vanilla NumPy and Legate NumPy (run up to 150000 iterations; mesh size nx=ny=401
):
l2norm = 1.1833319405787581e-08
icount = 150000
Total Iteration Time = 130.02262105768023
Time per iteration = 0.0008668174737178682
Elapsed time = 130.13445234298706 secs
l2norm = 1.1475736468288804e-07
icount = 150000
Total Iteration Time = 319.63438990681607
Time per iteration = 0.002130895932712107
Elapsed time = 1243.906419992447 secs
The first thing I noticed is that, in Legate case, the elapsed time is much higher than iteration time. I think it's reasonable from my understanding because Legate's task execution is not synchronized with Python's native time.perf_counter
. So when time.perf_counter
measures the time during each iteration, the calculations of that iteration are actually not done yet.
However, now I'm not sure my understanding is correct. I don't see the same behavior from @lightsighter's result. This makes me think there's something wrong with my Legate installation or configuration. @lightsighter Any suggestion on what I should check?
Here's my profiling results of only 1000 iterations, not sure if it is helpful: legate_prof.tar.gz
I looked into this a bit. The size of the original example (401x401) is large enough to trigger accelerated execution, but the amount of computation is not enough to cover the overhead of going through Legate. Also, the main loop checks for convergence on every iteration, meaning that the Legion runtime can't run ahead to schedule computation, so we're not getting any pipelining of runtime analysis > kernel execution.
I changed the code to do the convergence check every 100 iterations and increased the size of the array to 5001 x 5001 (see https://gist.github.com/manopapad/f9d0af92dcf008b7d30d316d466c6baa), and now the times look as follows:
legate --sysmem 50000 --cpus 32 29.py
)legate --sysmem 10000 --numamem 10000 --omps 2 --ompthreads 16 29.py
)legate --sysmem 10000 --gpus 1 --fbmem 14500 29.py
)Does anyone have more to say on this? If not then we can close this issue.
Apologies - I got distracted from the responses.
I do have a question - based on your tests, it appears that running the GPU is slower than running on the CPU. Is this your observation too?
Thanks!
Jeff
On Wed, Jul 28, 2021 at 5:31 PM Manolis Papadakis @.***> wrote:
Does anyone have more to say on this? If not then we can close this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888489623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBKVAYOWUJL2YUBA7E3T2A5FNANCNFSM4463WSEQ .
After increasing the size of the array (so that there's enough computation per iteration to cover the overhead of going through Legate) and doing the convergence check less frequently (so the main code has some leeway to "run ahead" and schedule work for the GPU) I get 14.76 iterations/sec on a single CPU vs 233.61 iterations/sec on a single GPU with legate, i.e. about 16x faster on a GPU.
What was the final problem size and how often did you do the convergence check?
Thanks!
Jeff
On Wed, Jul 28, 2021 at 7:19 PM Manolis Papadakis @.***> wrote:
After increasing the size of the array (so that there's enough computation per iteration to cover the overhead of going through Legate) and doing the convergence check less frequently (so the main code has some leeway to "run ahead" and schedule work for the GPU) I get 14.76 iterations/sec on a single CPU vs 233.61 iterations/sec on a single GPU with legate, i.e. about 16x faster on a GPU.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888558105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBOVET2KE3TWTHDRINDT2BJ3NANCNFSM4463WSEQ .
Thanks. I hate to be dense, but I didn't see those numbers in your previous comment. To me it looked like the previous results give the same total time for CPU and GPU. Can you explain this last set of results compared to the previous set?
Thanks!
On Wed, Jul 28, 2021 at 8:48 PM Manolis Papadakis @.***> wrote:
See #29 (comment) https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214 .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888611239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBLBA4ZRBYAHK3TX66DT2BULPANCNFSM4463WSEQ .
My two comments (https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214 and https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888558105) refer to the same runs. I ran both the CPU test and the GPU test for a similar amount of time, but they completed a different number of iterations:
single CPU: 500 iterations in 33.882s => 14.76 iterations/sec single GPU: 10000 iterations in 42.807s => 233.61 iterations/sec
Apologies. I didn't look at the number of iterations (don't worry - I slapped my forehead a few times). Thanks for the clarification.
Nothing more from me on the subject.
Thanks for all of your help!
On Thu, Jul 29, 2021 at 5:52 PM Manolis Papadakis @.***> wrote:
My two comments (#29 (comment) https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-855297214 and #29 (comment) https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-888558105) refer to the same runs. I ran both the CPU test and the GPU test for a similar amount of time, but they completed a different number of iterations:
single CPU: 500 iterations in 33.882s => 14.76 iterations/sec single GPU: 10000 iterations in 42.807s => 233.61 iterations/sec
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nv-legate/legate.numpy/issues/29#issuecomment-889342877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMYXLBP4JQHZFDL62PNBSZDT2GIMPANCNFSM4463WSEQ .
No worries, thanks for your interest in Legate!
I've been testing a simple Laplace Eq. solver to compare Python+Numpy to legate.numpy and legate is hugely slower than Numpy.
The code is taken from: https://barbagroup.github.io/essential_skills_RRC/laplace/1/ . The code I actually run is the following:
When I run it on my laptop with Anaconda Python3 and Numpy I get the following:
When I change the import line to legate.numpy, I usually stop the code after 15 minutes of wall time. I have let it run for up to 60 minutes and it never converges.
As a check, I've run the Numpy code with legate itself and it exactly matches the Numpy results.
I have been experimenting with replacing the l2norm computations with numpy specific functions (np.subtract, np.square, etc.) but I have achieved no increase in performance.
Does anyone have any recommendations?
Thanks!
Jeff
(edit by Manolis: added some formatting for the code sections)