serge-sans-paille / pythran

Ahead of Time compiler for numeric kernels
https://pythran.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.99k stars 193 forks source link

array assignment slower than numpy #2095

Open hmaarrfk opened 1 year ago

hmaarrfk commented 1 year ago

Hey, thanks for making this cool library. I really do believe that the advantages you outline in terms of ahead of time compilation are valuable to those building powerful scientific computation libraries.

I was trying my hand at doing a "simple" image analysis task: rgb -> nv12 conversion.

nv12 seems to be similar to yuv420 (I420) but with the U and V channels interleaved instead of on distinct planes.

This should be a simple transpose operation, but as is typical, it is easy to do this operation very slowly based on how you do it.

mark@nano 
--------- 
OS: Ubuntu 22.10 x86_64 
Host: 20UN005JUS ThinkPad X1 Nano Gen 1 
Kernel: 5.19.0-38-generic 
Uptime: 7 days, 1 hour, 43 mins 
Packages: 4010 (dpkg), 23 (flatpak), 12 (snap) 
Shell: bash 5.2.2 
Resolution: 2160x1350 
DE: GNOME 43.1 
WM: Mutter 
WM Theme: Adwaita 
Theme: Yaru-viridian [GTK2/3] 
Icons: Yaru-viridian [GTK2/3] 
Terminal: kitty 
CPU: 11th Gen Intel i7-1180G7 (8) @ 4.600GHz 
GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics] 
Memory: 6991MiB / 15704MiB

The operation amounts to something like transposing a 2D Array. From [u_0, u_1, ... u_n, v_0, v_v1, v_n] to [u_0, v_0, u_1, v_1, ... u_n, v_n].

To setup the problem, lets start with:

import numpy as np
N_iter = 200
image_shape = (3072, 3072)

uv = np.full(
    (image_shape[0] // 4 * 2, image_shape[1]),
    fill_value=0,
    dtype='uint8'
)
uv.shape = (-1,)

For some reason, I need to do the operation in place. In numpy, this would amount to:

import numpy as np
def pack_np(uv):
    uv_copy = uv.copy()
    u_size = uv.shape[0] // 2
    uv[0::2] = uv_copy[:u_size]
    uv[1::2] = uv_copy[u_size:]

On my computer, I get about 700-750 iterations per second with this loop:

import tqdm
N_iter = 200
for i in tqdm(range(N_iter)):
    pack_np(uv)

As a bound, I estimated that that the "upper bound of performance" would be achieved with

def double_copy(uv):
    uv_copy = uv.copy()
    u_size = uv.shape[0] // 2
    uv[...] = uv_copy
    uv[...] = uv_copy

for i in tqdm(range(N_iter)):
    double_copy(uv)

This acheives 1184 iterations / sec! Pretty good. I was hoping that pythran could help get that leve

I tries two diferent ways of doing this:

import pythran
%load_ext pythran.magic
%%pythran
import numpy as np

# pythran export pack_my_pythran_loop(uint8[:])
def pack_my_pythran_loop(uv):
    u_size = uv.shape[0] // 2

    uv_copy = uv.copy()
    for i in range(u_size):
        uv[i * 2 + 0] = uv_copy[i]
        uv[i * 2 + 1] = uv_copy[u_size + i]

# pythran export pack_my_pythran_assign(uint8[:])
def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_copy =  uv.copy()
    uv[0::2] = uv_copy[:u_size]
    uv[1::2] = uv_copy[u_size:]
from numba import njit as jit
@jit
def pack_numba(uv):
    u_size = uv.shape[0] // 2

    uv_copy = uv.copy()
    for i in range(u_size):
        uv[i * 2 + 0] = uv_copy[i]
        uv[i * 2 + 1] = uv_copy[u_size + i]

pack_numba(uv)
for i in tqdm(range(N_iter)):
    pack_numba(uv)

Achieves pretty close to 600 its/second.

I mean, I'm not really expecting too much "speedup here" but I figured that this was strange. I'm hoping that this little example can help improve this library.

Best.

Mark

library iterations per second
numpy 700
double copy 1200
pythran assignment 315
pythran loop 450
numba loop 600
serge-sans-paille commented 1 year ago

Thanks for the detailed report! I'm a bit puzzled by the result. I setup the following test, inspired by your test case:

python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'

with numpy (loop): 436 msec per loop with numpy (assign): 2.12 msec per loop with pythran: (loop) 755 usec per loop with pythran: (assign): 2.14 msec per loop with numba (loop): 1.05 msec with numba (assign): 1.16 msec

So i see an issue with the assign part (and I can work on it!) but none with the loop :-/

hmaarrfk commented 1 year ago

For full reproducibility, I probably owe you information about my conda environment, and compiler versions.

I was travelling when I wrote the report so it wasn't so easy for me to spin up new, clean, environments.

I'll try your little test when I get back on a few machines.

hmaarrfk commented 1 year ago
$ python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'
200 loops, best of 5: 1.88 msec per loop
(pythran) ✔ ~/git/pythran 
$ python -m timeit -s 'from pack import pack_my_pythran_assign; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_assign(uv)'
100 loops, best of 5: 3.01 msec per loop
Environment before installing numba ``` mamba create --name pythran pythran blas numpy --channel conda-forge --override-channels ``` ``` # packages in environment at /home/mark/mambaforge/envs/pythran: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_kmp_llvm conda-forge beniget 0.4.1 pyhd8ed1ab_0 conda-forge binutils_impl_linux-64 2.39 he00db2b_1 conda-forge binutils_linux-64 2.39 h5fc0e48_12 conda-forge blas 2.116 openblas conda-forge blas-devel 3.9.0 16_linux64_openblas conda-forge bzip2 1.0.8 h7f98852_4 conda-forge ca-certificates 2022.12.7 ha878542_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge gast 0.5.3 pyhd8ed1ab_0 conda-forge gcc_impl_linux-64 11.3.0 hab1b70f_19 conda-forge gcc_linux-64 11.3.0 he6f903b_12 conda-forge gxx_impl_linux-64 11.3.0 hab1b70f_19 conda-forge gxx_linux-64 11.3.0 hc203a17_12 conda-forge kernel-headers_linux-64 2.6.32 he073ed8_15 conda-forge ld_impl_linux-64 2.39 hcc3a1bd_1 conda-forge libblas 3.9.0 16_linux64_openblas conda-forge libcblas 3.9.0 16_linux64_openblas conda-forge libexpat 2.5.0 hcb278e6_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-devel_linux-64 11.3.0 h210ce93_19 conda-forge libgcc-ng 12.2.0 h65d4601_19 conda-forge libgfortran-ng 12.2.0 h69a702a_19 conda-forge libgfortran5 12.2.0 h337968e_19 conda-forge libgomp 12.2.0 h65d4601_19 conda-forge liblapack 3.9.0 16_linux64_openblas conda-forge liblapacke 3.9.0 16_linux64_openblas conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libopenblas 0.3.21 pthreads_h78a6416_3 conda-forge libsanitizer 11.3.0 h239ccf8_19 conda-forge libsqlite 3.40.0 h753d276_0 conda-forge libstdcxx-devel_linux-64 11.3.0 h210ce93_19 conda-forge libstdcxx-ng 12.2.0 h46fd767_19 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libzlib 1.2.13 h166bdaf_4 conda-forge llvm-openmp 16.0.1 h417c0b6_0 conda-forge ncurses 6.3 h27087fc_1 conda-forge numpy 1.24.2 py311h8e6699e_0 conda-forge openblas 0.3.21 pthreads_h320a7e8_3 conda-forge openssl 3.1.0 h0b41bf4_0 conda-forge pip 23.0.1 pyhd8ed1ab_0 conda-forge ply 3.11 py_1 conda-forge python 3.11.3 h2755cc3_0_cpython conda-forge python_abi 3.11 3_cp311 conda-forge pythran 0.12.1 py311hcb41070_0 conda-forge readline 8.2 h8228510_1 conda-forge setuptools 67.6.1 pyhd8ed1ab_0 conda-forge sysroot_linux-64 2.12 he073ed8_15 conda-forge tk 8.6.12 h27826a3_0 conda-forge tzdata 2023c h71feb2d_0 conda-forge wheel 0.40.0 pyhd8ed1ab_0 conda-forge xsimd 8.0.5 h4bd325d_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge zstd 1.5.2 h3eb15da_6 conda-forge ``` ``` active environment : pythran active env location : /home/mark/mambaforge/envs/pythran shell level : 1 user config file : /home/mark/.condarc populated config files : /home/mark/mambaforge/.condarc /home/mark/.condarc conda version : 23.3.1 conda-build version : 3.24.0 python version : 3.9.16.final.0 virtual packages : __archspec=1=x86_64 __glibc=2.36=0 __linux=5.19.0=0 __unix=0=0 base environment : /home/mark/mambaforge (writable) conda av data dir : /home/mark/mambaforge/etc/conda conda av metadata url : None channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch package cache : /home/mark/mambaforge/pkgs /home/mark/.conda/pkgs envs directories : /home/mark/mambaforge/envs /home/mark/.conda/envs platform : linux-64 user-agent : conda/23.3.1 requests/2.28.2 CPython/3.9.16 Linux/5.19.0-38-generic ubuntu/22.10 glibc/2.36 UID:GID : 1000:1000 netrc file : None offline mode : False ```

I included the output of pythran -E as well.

pythran_pack.zip

hmaarrfk commented 1 year ago

I should also note, that the "size" of the array here is kinda huge compared to what I see many benchmarks use.

Often I see that people benchmark with 256 x 256 images or 512 x 512 images. These images are less than 1MB in size (with uint8 precision).

However, this image, encoded in yuv420p is more than 13MB.

It doesn't fit in my cache, all at once. (but the uv part might) https://ark.intel.com/content/www/us/en/ark/products/208663/intel-core-i71180g7-processor-12m-cache-up-to-4-60-ghz-with-ipu.html

It might fit in yours if you have a large machine which may explain some differences in our results.

serge-sans-paille commented 1 year ago

Thanks for the extra piece of information. Does #2097 improve the situation? It does on my setup, making pack_my_pythran_assign as fast as the numpy version.

EDIT: this PR doesn't passes full validation yet, but it should be ok for our concern though. EDIT²: #2097 now fully functionnal, waiting for your feedback before merging it.

hmaarrfk commented 1 year ago

Unfortunately, it does not resolve things.

Is there any intermediate output I can give you to debug things? compiler info and whatnot? pack_pythran_dev_2097.zip

serge-sans-paille commented 1 year ago

Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-)

jeanlaroche commented 1 year ago

Wow, that's great news!

On 4/14/23 10:15 AM, serge-sans-paille wrote:

Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-)

— Reply to this email directly, view it on GitHub https://github.com/serge-sans-paille/pythran/issues/2095#issuecomment-1508982214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGUBEPN76ICHF3U4DUDLAYTXBGA4LANCNFSM6AAAAAAWYJ36D4. You are receiving this because you are subscribed to this thread.Message ID: @.***>

serge-sans-paille commented 1 year ago

@jeanlaroche : you can check PR #2096, #2097 and #2096, they are likely to be merged soon.

@hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many mov in the pythran generated code compared to numpy's, I'll investigate.

jeanlaroche commented 1 year ago

https://github.com/serge-sans-paille/pythran/pull/2096: no problem.

https://github.com/serge-sans-paille/pythran/pull/2097 no problem

https://github.com/serge-sans-paille/pythran/pull/2098 is not building.

In file included from /Users/jlaroche/packages/system/algo/Transients/transient_api.cpp:4: /Users/jlaroche/packages/system/algo/build/darwin/Debug-Individual-Xcode/generated/transient.cpp:999:55: error: no matching function for call to 'call'       typename pythonic::assignable_noescape<decltype(pythonic::types::call(transient_tf_bridge::tflite_bridge::runModel(), 0L, Input_x, std::get<1>(pythonic::types::as_const(self))))>::type y = pythonic::types::call(transient_tf_bridge::tflite_bridge::runM...

That last one isn't in the message below where you repeat 2096... and the error could be because of some of my code. Can you confirm you're planning on merging 2098 as well?

Jean

On 4/14/23 10:27 AM, serge-sans-paille wrote:

@jeanlaroche https://github.com/jeanlaroche : you can check PR #2096 https://github.com/serge-sans-paille/pythran/pull/2096, #2097 https://github.com/serge-sans-paille/pythran/pull/2097 and #2096 https://github.com/serge-sans-paille/pythran/pull/2096, they are likely to be merged soon.

@hmaarrfk https://github.com/hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many |mov| in the pythran generated code compared to numpy's, I'll investigate.

— Reply to this email directly, view it on GitHub https://github.com/serge-sans-paille/pythran/issues/2095#issuecomment-1508995207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGUBEPOP33ZJ5MWMY7FXWGDXBGCIPANCNFSM6AAAAAAWYJ36D4. You are receiving this because you were mentioned.Message ID: @.***>

serge-sans-paille commented 1 year ago

Short notice: the following numpy version achieves the same result with twice less memory:

def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_lo =  uv[:u_size].copy()
    uv[1::2] = uv[u_size:]
    uv[0::2] = uv_lo

no significant impact on runtime from the pythran perspective on my setup though.

EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)

hmaarrfk commented 1 year ago

Short notice: the following numpy version achieves the same result with twice less memory:

def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_lo =  uv[:u_size].copy()
    uv[1::2] = uv[u_size:]
    uv[0::2] = uv_lo

no significant impact on runtime from the pythran perspective on my setup though.

EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)

Away from my computer at the moment. But at some point I found mixed results with this approach. I think my hunch was that due to the memory aliasing.

Looking into things in the past, i found that x86 had some inter sting instructions that skipped the cache. I didn't know if pythran could be optimized for that, i wanted an to avoid speedups due to that optimization.

hmaarrfk commented 1 year ago

Looking into this now that I have a second, it seems that there is no action item on my part. keep me posted.

serge-sans-paille commented 1 year ago

Pythran is not doing anything specific related to cache. I can get good speedup if I remove a runtime check for aliasing between lhs and rhs when assigning between slices, but I currently fail at doing so in an elegant way.

hmaarrfk commented 1 year ago

As for the memory copy, I'm mostly talking about my results from:

https://github.com/awreece/memory-bandwidth-demo

./memory_profiler 
           read_memory_rep_lodsq: 12.19 GiB/s
                read_memory_loop: 19.69 GiB/s
                 read_memory_sse: 18.63 GiB/s
                 read_memory_avx: 19.02 GiB/s
        read_memory_prefetch_avx: 16.90 GiB/s
               write_memory_loop: 13.23 GiB/s
          write_memory_rep_stosq: 39.86 GiB/s
                write_memory_sse: 13.97 GiB/s
    write_memory_nontemporal_sse: 43.03 GiB/s
                write_memory_avx: 13.26 GiB/s
    write_memory_nontemporal_avx: 47.87 GiB/s
             write_memory_memset: 39.80 GiB/s
       read_memory_rep_lodsq_omp: 49.37 GiB/s
            read_memory_loop_omp: 52.74 GiB/s
             read_memory_sse_omp: 54.41 GiB/s
             read_memory_avx_omp: 52.21 GiB/s
    read_memory_prefetch_avx_omp: 47.91 GiB/s
           write_memory_loop_omp: 31.18 GiB/s
      write_memory_rep_stosq_omp: 55.99 GiB/s
            write_memory_sse_omp: 22.40 GiB/s
write_memory_nontemporal_sse_omp: 56.80 GiB/s
            write_memory_avx_omp: 31.22 GiB/s
write_memory_nontemporal_avx_omp: 54.89 GiB/s
         write_memory_memset_omp: 54.15 GiB/s

I feel like this is a "future improvement" maybe.....

I'm not a fan of using "threading" for speedups, but the non_temporal_avx shows an impressive speedup.