Open hmaarrfk opened 1 year ago
Thanks for the detailed report! I'm a bit puzzled by the result. I setup the following test, inspired by your test case:
python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'
with numpy (loop): 436 msec per loop with numpy (assign): 2.12 msec per loop with pythran: (loop) 755 usec per loop with pythran: (assign): 2.14 msec per loop with numba (loop): 1.05 msec with numba (assign): 1.16 msec
So i see an issue with the assign part (and I can work on it!) but none with the loop :-/
For full reproducibility, I probably owe you information about my conda environment, and compiler versions.
I was travelling when I wrote the report so it wasn't so easy for me to spin up new, clean, environments.
I'll try your little test when I get back on a few machines.
$ python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'
200 loops, best of 5: 1.88 msec per loop
(pythran) ✔ ~/git/pythran
$ python -m timeit -s 'from pack import pack_my_pythran_assign; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_assign(uv)'
100 loops, best of 5: 3.01 msec per loop
I included the output of pythran -E
as well.
I should also note, that the "size" of the array here is kinda huge compared to what I see many benchmarks use.
Often I see that people benchmark with 256 x 256 images or 512 x 512 images. These images are less than 1MB in size (with uint8 precision).
However, this image, encoded in yuv420p is more than 13MB.
It doesn't fit in my cache, all at once. (but the uv part might) https://ark.intel.com/content/www/us/en/ark/products/208663/intel-core-i71180g7-processor-12m-cache-up-to-4-60-ghz-with-ipu.html
It might fit in yours if you have a large machine which may explain some differences in our results.
Thanks for the extra piece of information. Does #2097 improve the situation? It does on my setup, making pack_my_pythran_assign
as fast as the numpy version.
EDIT: this PR doesn't passes full validation yet, but it should be ok for our concern though. EDIT²: #2097 now fully functionnal, waiting for your feedback before merging it.
Unfortunately, it does not resolve things.
Is there any intermediate output I can give you to debug things? compiler info and whatnot? pack_pythran_dev_2097.zip
Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-)
Wow, that's great news!
On 4/14/23 10:15 AM, serge-sans-paille wrote:
Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-)
— Reply to this email directly, view it on GitHub https://github.com/serge-sans-paille/pythran/issues/2095#issuecomment-1508982214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGUBEPN76ICHF3U4DUDLAYTXBGA4LANCNFSM6AAAAAAWYJ36D4. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@jeanlaroche : you can check PR #2096, #2097 and #2096, they are likely to be merged soon.
@hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many mov
in the pythran generated code compared to numpy's, I'll investigate.
https://github.com/serge-sans-paille/pythran/pull/2096: no problem.
https://github.com/serge-sans-paille/pythran/pull/2097 no problem
https://github.com/serge-sans-paille/pythran/pull/2098 is not building.
In file included from /Users/jlaroche/packages/system/algo/Transients/transient_api.cpp:4: /Users/jlaroche/packages/system/algo/build/darwin/Debug-Individual-Xcode/generated/transient.cpp:999:55: error: no matching function for call to 'call' typename pythonic::assignable_noescape<decltype(pythonic::types::call(transient_tf_bridge::tflite_bridge::runModel(), 0L, Input_x, std::get<1>(pythonic::types::as_const(self))))>::type y = pythonic::types::call(transient_tf_bridge::tflite_bridge::runM...
That last one isn't in the message below where you repeat 2096... and the error could be because of some of my code. Can you confirm you're planning on merging 2098 as well?
Jean
On 4/14/23 10:27 AM, serge-sans-paille wrote:
@jeanlaroche https://github.com/jeanlaroche : you can check PR #2096 https://github.com/serge-sans-paille/pythran/pull/2096, #2097 https://github.com/serge-sans-paille/pythran/pull/2097 and #2096 https://github.com/serge-sans-paille/pythran/pull/2096, they are likely to be merged soon.
@hmaarrfk https://github.com/hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many |mov| in the pythran generated code compared to numpy's, I'll investigate.
— Reply to this email directly, view it on GitHub https://github.com/serge-sans-paille/pythran/issues/2095#issuecomment-1508995207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGUBEPOP33ZJ5MWMY7FXWGDXBGCIPANCNFSM6AAAAAAWYJ36D4. You are receiving this because you were mentioned.Message ID: @.***>
Short notice: the following numpy version achieves the same result with twice less memory:
def pack_my_pythran_assign(uv):
u_size = uv.shape[0] // 2
uv_lo = uv[:u_size].copy()
uv[1::2] = uv[u_size:]
uv[0::2] = uv_lo
no significant impact on runtime from the pythran perspective on my setup though.
EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)
Short notice: the following numpy version achieves the same result with twice less memory:
def pack_my_pythran_assign(uv): u_size = uv.shape[0] // 2 uv_lo = uv[:u_size].copy() uv[1::2] = uv[u_size:] uv[0::2] = uv_lo
no significant impact on runtime from the pythran perspective on my setup though.
EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)
Away from my computer at the moment. But at some point I found mixed results with this approach. I think my hunch was that due to the memory aliasing.
Looking into things in the past, i found that x86 had some inter sting instructions that skipped the cache. I didn't know if pythran could be optimized for that, i wanted an to avoid speedups due to that optimization.
Looking into this now that I have a second, it seems that there is no action item on my part. keep me posted.
Pythran is not doing anything specific related to cache. I can get good speedup if I remove a runtime check for aliasing between lhs and rhs when assigning between slices, but I currently fail at doing so in an elegant way.
As for the memory copy, I'm mostly talking about my results from:
https://github.com/awreece/memory-bandwidth-demo
./memory_profiler
read_memory_rep_lodsq: 12.19 GiB/s
read_memory_loop: 19.69 GiB/s
read_memory_sse: 18.63 GiB/s
read_memory_avx: 19.02 GiB/s
read_memory_prefetch_avx: 16.90 GiB/s
write_memory_loop: 13.23 GiB/s
write_memory_rep_stosq: 39.86 GiB/s
write_memory_sse: 13.97 GiB/s
write_memory_nontemporal_sse: 43.03 GiB/s
write_memory_avx: 13.26 GiB/s
write_memory_nontemporal_avx: 47.87 GiB/s
write_memory_memset: 39.80 GiB/s
read_memory_rep_lodsq_omp: 49.37 GiB/s
read_memory_loop_omp: 52.74 GiB/s
read_memory_sse_omp: 54.41 GiB/s
read_memory_avx_omp: 52.21 GiB/s
read_memory_prefetch_avx_omp: 47.91 GiB/s
write_memory_loop_omp: 31.18 GiB/s
write_memory_rep_stosq_omp: 55.99 GiB/s
write_memory_sse_omp: 22.40 GiB/s
write_memory_nontemporal_sse_omp: 56.80 GiB/s
write_memory_avx_omp: 31.22 GiB/s
write_memory_nontemporal_avx_omp: 54.89 GiB/s
write_memory_memset_omp: 54.15 GiB/s
I feel like this is a "future improvement" maybe.....
I'm not a fan of using "threading" for speedups, but the non_temporal_avx shows an impressive speedup.
Hey, thanks for making this cool library. I really do believe that the advantages you outline in terms of ahead of time compilation are valuable to those building powerful scientific computation libraries.
I was trying my hand at doing a "simple" image analysis task: rgb -> nv12 conversion.
nv12 seems to be similar to yuv420 (I420) but with the U and V channels interleaved instead of on distinct planes.
This should be a simple transpose operation, but as is typical, it is easy to do this operation very slowly based on how you do it.
The operation amounts to something like transposing a 2D Array. From
[u_0, u_1, ... u_n, v_0, v_v1, v_n]
to[u_0, v_0, u_1, v_1, ... u_n, v_n]
.To setup the problem, lets start with:
For some reason, I need to do the operation in place. In numpy, this would amount to:
On my computer, I get about 700-750 iterations per second with this loop:
As a bound, I estimated that that the "upper bound of performance" would be achieved with
This acheives 1184 iterations / sec! Pretty good. I was hoping that pythran could help get that leve
I tries two diferent ways of doing this:
pack_my_pythran_assign
acheives 315 its / second.pack_my_pythran_loop
achieves 460 its / secondI tried with numba for completeness
Achieves pretty close to 600 its/second.
I mean, I'm not really expecting too much "speedup here" but I figured that this was strange. I'm hoping that this little example can help improve this library.
Best.
Mark