pymc-devs / pytensor

PyTensor allows you to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.
https://pytensor.readthedocs.io
Other
370 stars 109 forks source link

pytensor and blas problems on on MacOS 15 Sequoia with Apple Silicon #1005

Closed danieltomasz closed 3 weeks ago

danieltomasz commented 2 months ago

Describe the issue:

Since update to MacOS 15 I have a problem with using Apple implementation of BLAS. Installing pytensor from miniconda3-3.12-24.7.1-0 via conda create -n voxel-bayes-3.12 -c conda-forge pytensor seems to install openblas instead of accelerate.

~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12   -c conda-forge  pytensor
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12

  added / updated specs:
    - pytensor

The following NEW packages will be INSTALLED:

  accelerate         conda-forge/noarch::accelerate-0.34.2-pyhd8ed1ab_0 
  blas               conda-forge/osx-arm64::blas-2.124-openblas 
  blas-devel         conda-forge/osx-arm64::blas-devel-3.9.0-24_osxarm64_openblas 
  brotli-python      conda-forge/osx-arm64::brotli-python-1.1.0-py312hde4cb15_2 
  bzip2              conda-forge/osx-arm64::bzip2-1.0.8-h99b78c6_7 
  ca-certificates    conda-forge/osx-arm64::ca-certificates-2024.8.30-hf0a4a13_0 
  cctools_osx-arm64  conda-forge/osx-arm64::cctools_osx-arm64-1010.6-h4208deb_1 
  certifi            conda-forge/noarch::certifi-2024.8.30-pyhd8ed1ab_0 
  cffi               conda-forge/osx-arm64::cffi-1.17.1-py312h0fad829_0 
  charset-normalizer conda-forge/noarch::charset-normalizer-3.3.2-pyhd8ed1ab_0 
  clang              conda-forge/osx-arm64::clang-17.0.6-default_h360f5da_7 
  clang-17           conda-forge/osx-arm64::clang-17-17.0.6-default_h146c034_7 
  clang_impl_osx-ar~ conda-forge/osx-arm64::clang_impl_osx-arm64-17.0.6-he47c785_19 
  clang_osx-arm64    conda-forge/osx-arm64::clang_osx-arm64-17.0.6-h54d7cd3_19 
  clangxx            conda-forge/osx-arm64::clangxx-17.0.6-default_h360f5da_7 
  clangxx_impl_osx-~ conda-forge/osx-arm64::clangxx_impl_osx-arm64-17.0.6-h50f59cd_19 
  clangxx_osx-arm64  conda-forge/osx-arm64::clangxx_osx-arm64-17.0.6-h54d7cd3_19 
  colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_0 
  compiler-rt        conda-forge/osx-arm64::compiler-rt-17.0.6-h856b3c1_2 
  compiler-rt_osx-a~ conda-forge/noarch::compiler-rt_osx-arm64-17.0.6-h832e737_2 
  cons               conda-forge/noarch::cons-0.4.6-pyhd8ed1ab_0 
  etuples            conda-forge/noarch::etuples-0.3.9-pyhd8ed1ab_0 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_0 
  fsspec             conda-forge/noarch::fsspec-2024.9.0-pyhff2d567_0 
  gmp                conda-forge/osx-arm64::gmp-6.3.0-h7bae524_2 
  gmpy2              conda-forge/osx-arm64::gmpy2-2.1.5-py312h87fada9_2 
  h2                 conda-forge/noarch::h2-4.1.0-pyhd8ed1ab_0 
  hpack              conda-forge/noarch::hpack-4.0.0-pyh9f0ad1d_0 
  huggingface_hub    conda-forge/noarch::huggingface_hub-0.25.1-pyhd8ed1ab_0 
  hyperframe         conda-forge/noarch::hyperframe-6.0.1-pyhd8ed1ab_0 
  icu                conda-forge/osx-arm64::icu-75.1-hfee45f7_0 
  idna               conda-forge/noarch::idna-3.10-pyhd8ed1ab_0 
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_0 
  ld64_osx-arm64     conda-forge/osx-arm64::ld64_osx-arm64-951.9-hc81425b_1 
  libabseil          conda-forge/osx-arm64::libabseil-20240116.2-cxx17_h00cdb27_1 
  libblas            conda-forge/osx-arm64::libblas-3.9.0-24_osxarm64_openblas 
  libcblas           conda-forge/osx-arm64::libcblas-3.9.0-24_osxarm64_openblas 
  libclang-cpp17     conda-forge/osx-arm64::libclang-cpp17-17.0.6-default_h146c034_7 
  libcxx             conda-forge/osx-arm64::libcxx-19.1.0-ha82da77_0 
  libcxx-devel       conda-forge/osx-arm64::libcxx-devel-17.0.6-h86353a2_6 
  libexpat           conda-forge/osx-arm64::libexpat-2.6.3-hf9b8971_0 
  libffi             conda-forge/osx-arm64::libffi-3.4.2-h3422bc3_5 
  libgfortran        conda-forge/osx-arm64::libgfortran-5.0.0-13_2_0_hd922786_3 
  libgfortran5       conda-forge/osx-arm64::libgfortran5-13.2.0-hf226fd6_3 
  libiconv           conda-forge/osx-arm64::libiconv-1.17-h0d3ecfb_2 
  liblapack          conda-forge/osx-arm64::liblapack-3.9.0-24_osxarm64_openblas 
  liblapacke         conda-forge/osx-arm64::liblapacke-3.9.0-24_osxarm64_openblas 
  libllvm17          conda-forge/osx-arm64::libllvm17-17.0.6-h5090b49_2 
  libopenblas        conda-forge/osx-arm64::libopenblas-0.3.27-openmp_h517c56d_1 
  libprotobuf        conda-forge/osx-arm64::libprotobuf-4.25.3-hc39d83c_1 
  libsqlite          conda-forge/osx-arm64::libsqlite-3.46.1-hc14010f_0 
  libtorch           conda-forge/osx-arm64::libtorch-2.4.0-cpu_generic_h4365fe2_1 
  libuv              conda-forge/osx-arm64::libuv-1.49.0-hd74edd7_0 
  libxml2            conda-forge/osx-arm64::libxml2-2.12.7-h01dff8b_4 
  libzlib            conda-forge/osx-arm64::libzlib-1.3.1-hfb2fe0b_1 
  llvm-openmp        conda-forge/osx-arm64::llvm-openmp-18.1.8-hde57baf_1 
  llvm-tools         conda-forge/osx-arm64::llvm-tools-17.0.6-h5090b49_2 
  logical-unificati~ conda-forge/noarch::logical-unification-0.4.6-pyhd8ed1ab_0 
  macosx_deployment~ conda-forge/noarch::macosx_deployment_target_osx-arm64-11.0-h6553868_1 
  markupsafe         conda-forge/osx-arm64::markupsafe-2.1.5-py312h024a12e_1 
  minikanren         conda-forge/noarch::minikanren-1.0.3-pyhd8ed1ab_0 
  mpc                conda-forge/osx-arm64::mpc-1.3.1-h8f1351a_1 
  mpfr               conda-forge/osx-arm64::mpfr-4.2.1-hb693164_3 
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_0 
  multipledispatch   conda-forge/noarch::multipledispatch-0.6.0-pyhd8ed1ab_1 
  ncurses            conda-forge/osx-arm64::ncurses-6.5-h7bae524_1 
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1 
  nomkl              conda-forge/noarch::nomkl-1.0-h5ca1d4c_0 
  numpy              conda-forge/osx-arm64::numpy-1.26.4-py312h8442bc7_0 
  openblas           conda-forge/osx-arm64::openblas-0.3.27-openmp_h560b219_1 
  openssl            conda-forge/osx-arm64::openssl-3.3.2-h8359307_0 
  packaging          conda-forge/noarch::packaging-24.1-pyhd8ed1ab_0 
  pip                conda-forge/noarch::pip-24.2-pyh8b19718_1 
  psutil             conda-forge/osx-arm64::psutil-6.0.0-py312h024a12e_1 
  pycparser          conda-forge/noarch::pycparser-2.22-pyhd8ed1ab_0 
  pysocks            conda-forge/noarch::pysocks-1.7.1-pyha2e5f31_6 
  pytensor           conda-forge/osx-arm64::pytensor-2.25.4-py312h3f593ad_0 
  pytensor-base      conda-forge/osx-arm64::pytensor-base-2.25.4-py312h02baea5_0 
  python             conda-forge/osx-arm64::python-3.12.6-h739c21a_1_cpython 
  python_abi         conda-forge/osx-arm64::python_abi-3.12-5_cp312 
  pytorch            conda-forge/osx-arm64::pytorch-2.4.0-cpu_generic_py312h6bd8f41_1 
  pyyaml             conda-forge/osx-arm64::pyyaml-6.0.2-py312h024a12e_1 
  readline           conda-forge/osx-arm64::readline-8.2-h92ec313_1 
  requests           conda-forge/noarch::requests-2.32.3-pyhd8ed1ab_0 
  safetensors        conda-forge/osx-arm64::safetensors-0.4.5-py312he431725_0 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312heb3a901_0 
  setuptools         conda-forge/noarch::setuptools-75.1.0-pyhd8ed1ab_0 
  sigtool            conda-forge/osx-arm64::sigtool-0.1.3-h44b9a77_0 
  six                conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 
  sleef              conda-forge/osx-arm64::sleef-3.7-h7783ee8_0 
  sympy              conda-forge/noarch::sympy-1.13.3-pypyh2585a3b_103 
  tapi               conda-forge/osx-arm64::tapi-1300.6.5-h03f4b80_0 
  tk                 conda-forge/osx-arm64::tk-8.6.13-h5083fa2_1 
  toolz              conda-forge/noarch::toolz-0.12.1-pyhd8ed1ab_0 
  tqdm               conda-forge/noarch::tqdm-4.66.5-pyhd8ed1ab_0 
  typing-extensions  conda-forge/noarch::typing-extensions-4.12.2-hd8ed1ab_0 
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_0 
  tzdata             conda-forge/noarch::tzdata-2024a-h8827d51_1 
  urllib3            conda-forge/noarch::urllib3-2.2.3-pyhd8ed1ab_0 
  wheel              conda-forge/noarch::wheel-0.44.0-pyhd8ed1ab_0 
  xz                 conda-forge/osx-arm64::xz-5.2.6-h57fd34a_0 
  yaml               conda-forge/osx-arm64::yaml-0.2.5-h3422bc3_2 
  zstandard          conda-forge/osx-arm64::zstandard-0.23.0-py312h15fbf35_1 
  zstd               conda-forge/osx-arm64::zstd-1.5.6-hb46c0d2_0 

Proceed ([y]/n)? y

Running this the check

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)

        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.

        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -L/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib -llapack -lblas -lcblas -lm -Wl,-rpath,/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:07:06) [Clang 17.0.6 ]
    sys.prefix= /Users/daniel/.pyenv/versions/voxel-bayes-3.12
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include
    lib directory: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib
    name: blas
    openblas configuration: unknown
    pc file directory: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig
    version: 3.9.0
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4569863840
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem,
      /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  c++:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang++
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.8
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  cross-compiled: true
  host:
    cpu: arm64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/bin/python
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 31.56s on CPU (with direct PyTensor binding to blas

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

And when I try to run the same command but in env with pip installed pytensor results in this

Some PyTensor flags:
    blas__ldflags= 
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.6 (main, Sep 28 2024, 17:45:34) [Clang 15.0.0 (clang-1500.3.9.4)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.6/envs/zotero-3.12.6
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
/Users/daniel/.pyenv/versions/3.12.6/envs/zotero-3.12.6/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "cc",
      "args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "c++",
      "args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "openblas64",
      "found": true,
      "version": "0.3.23.dev",
      "detection method": "pkgconfig",
      "include directory": "/opt/arm64-builds/include",
      "lib directory": "/opt/arm64-builds/lib",
      "openblas configuration": "USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3",
      "pc file directory": "/usr/local/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4335021056",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-q69bfk1p/cp312-macosx_arm64/build/venv/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.6/envs/zotero-3.12.6/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 45.75s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

When I try to specify the accelerate the old way via "libblas==accelerate" when installing the conda environment, when I try to run this it fails , I copied the output here https://discourse.pymc.io/t/pytensor-support-to-apple-accelerate-blas-with-conda-forge-on-macos-15/15131/2

Reproducable code example:

from `python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")`

Error message:

No response

PyTensor version information:

conda-forge/osx-arm64::pytensor-2.25.4-py312h3f593ad_0

Context for the issue:

No response

maresb commented 2 months ago

Thanks a lot @danieltomasz for the very high quality report. @lucianopaz, do you have any thoughts regarding the BLAS selection mechanism?

danieltomasz commented 2 months ago

One thing I learned & might be useful - numpy 2.2 installed via pip use accelerate, numpy 2.2 installed via the same conda installs with openblas (I checked this via numpy.show_config()) I installed it in separate env just to check, bc pytensor doesnt support yet numpy >= 2.0

maresb commented 2 months ago

That is indeed very interesting, thanks @danieltomasz.

The Conda dependency chain is:

conda-forge/osx-arm64/pytensor-2.25.4-py312h3f593ad_0.condaaccelerate, blas conda-forge/osx-arm64/blas-2.124-openblas.condablas-devel 3.9.0 blas-devel 3.9.0openblas 0.3.27.*

One way to get more flexibility to help debug this is to instead use the pytensor-base package on conda-forge. That should allow us to specify accelerate without installing openblas. But you'll need to install your own C compilers as well.

@danieltomasz, does this give you something to experiment with? I don't have a Mac myself, so unfortunately I can't directly debug this.

danieltomasz commented 2 months ago

When I force "libblas=*=*accelerate"

~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12  -c conda-forge  pytensor "libblas=*=*accelerate" 
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12

  added / updated specs:
    - libblas[build=*accelerate]
    - pytensor

The following NEW packages will be INSTALLED:

  accelerate         conda-forge/noarch::accelerate-0.34.2-pyhd8ed1ab_0 
  blas               conda-forge/osx-arm64::blas-2.124-accelerate 
  blas-devel         conda-forge/osx-arm64::blas-devel-3.9.0-24_osxarm64_accelerate 
  brotli-python      conda-forge/osx-arm64::brotli-python-1.1.0-py312hde4cb15_2 
  bzip2              conda-forge/osx-arm64::bzip2-1.0.8-h99b78c6_7 
  ca-certificates    conda-forge/osx-arm64::ca-certificates-2024.8.30-hf0a4a13_0 
  cctools_osx-arm64  conda-forge/osx-arm64::cctools_osx-arm64-1010.6-h4208deb_1 
  certifi            conda-forge/noarch::certifi-2024.8.30-pyhd8ed1ab_0 
  cffi               conda-forge/osx-arm64::cffi-1.17.1-py312h0fad829_0 
  charset-normalizer conda-forge/noarch::charset-normalizer-3.3.2-pyhd8ed1ab_0 
  clang              conda-forge/osx-arm64::clang-17.0.6-default_h360f5da_7 
  clang-17           conda-forge/osx-arm64::clang-17-17.0.6-default_h146c034_7 
  clang_impl_osx-ar~ conda-forge/osx-arm64::clang_impl_osx-arm64-17.0.6-he47c785_19 
  clang_osx-arm64    conda-forge/osx-arm64::clang_osx-arm64-17.0.6-h54d7cd3_19 
  clangxx            conda-forge/osx-arm64::clangxx-17.0.6-default_h360f5da_7 
  clangxx_impl_osx-~ conda-forge/osx-arm64::clangxx_impl_osx-arm64-17.0.6-h50f59cd_19 
  clangxx_osx-arm64  conda-forge/osx-arm64::clangxx_osx-arm64-17.0.6-h54d7cd3_19 
  colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_0 
  compiler-rt        conda-forge/osx-arm64::compiler-rt-17.0.6-h856b3c1_2 
  compiler-rt_osx-a~ conda-forge/noarch::compiler-rt_osx-arm64-17.0.6-h832e737_2 
  cons               conda-forge/noarch::cons-0.4.6-pyhd8ed1ab_0 
  etuples            conda-forge/noarch::etuples-0.3.9-pyhd8ed1ab_0 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_0 
  fsspec             conda-forge/noarch::fsspec-2024.9.0-pyhff2d567_0 
  gmp                conda-forge/osx-arm64::gmp-6.3.0-h7bae524_2 
  gmpy2              conda-forge/osx-arm64::gmpy2-2.1.5-py312h87fada9_2 
  h2                 conda-forge/noarch::h2-4.1.0-pyhd8ed1ab_0 
  hpack              conda-forge/noarch::hpack-4.0.0-pyh9f0ad1d_0 
  huggingface_hub    conda-forge/noarch::huggingface_hub-0.25.1-pyhd8ed1ab_0 
  hyperframe         conda-forge/noarch::hyperframe-6.0.1-pyhd8ed1ab_0 
  icu                conda-forge/osx-arm64::icu-75.1-hfee45f7_0 
  idna               conda-forge/noarch::idna-3.10-pyhd8ed1ab_0 
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_0 
  ld64_osx-arm64     conda-forge/osx-arm64::ld64_osx-arm64-951.9-hc81425b_1 
  libabseil          conda-forge/osx-arm64::libabseil-20240116.2-cxx17_h00cdb27_1 
  libblas            conda-forge/osx-arm64::libblas-3.9.0-24_osxarm64_accelerate 
  libcblas           conda-forge/osx-arm64::libcblas-3.9.0-24_osxarm64_accelerate 
  libclang-cpp17     conda-forge/osx-arm64::libclang-cpp17-17.0.6-default_h146c034_7 
  libcxx             conda-forge/osx-arm64::libcxx-19.1.0-ha82da77_0 
  libcxx-devel       conda-forge/osx-arm64::libcxx-devel-17.0.6-h86353a2_6 
  libexpat           conda-forge/osx-arm64::libexpat-2.6.3-hf9b8971_0 
  libffi             conda-forge/osx-arm64::libffi-3.4.2-h3422bc3_5 
  libgfortran        conda-forge/osx-arm64::libgfortran-5.0.0-13_2_0_hd922786_3 
  libgfortran5       conda-forge/osx-arm64::libgfortran5-13.2.0-hf226fd6_3 
  libiconv           conda-forge/osx-arm64::libiconv-1.17-h0d3ecfb_2 
  liblapack          conda-forge/osx-arm64::liblapack-3.9.0-24_osxarm64_accelerate 
  liblapacke         conda-forge/osx-arm64::liblapacke-3.9.0-24_osxarm64_accelerate 
  libllvm17          conda-forge/osx-arm64::libllvm17-17.0.6-h5090b49_2 
  libprotobuf        conda-forge/osx-arm64::libprotobuf-4.25.3-hc39d83c_1 
  libsqlite          conda-forge/osx-arm64::libsqlite-3.46.1-hc14010f_0 
  libtorch           conda-forge/osx-arm64::libtorch-2.4.0-cpu_generic_h4365fe2_1 
  libuv              conda-forge/osx-arm64::libuv-1.49.0-hd74edd7_0 
  libxml2            conda-forge/osx-arm64::libxml2-2.12.7-h01dff8b_4 
  libzlib            conda-forge/osx-arm64::libzlib-1.3.1-hfb2fe0b_1 
  llvm-openmp        conda-forge/osx-arm64::llvm-openmp-18.1.8-hde57baf_1 
  llvm-tools         conda-forge/osx-arm64::llvm-tools-17.0.6-h5090b49_2 
  logical-unificati~ conda-forge/noarch::logical-unification-0.4.6-pyhd8ed1ab_0 
  macosx_deployment~ conda-forge/noarch::macosx_deployment_target_osx-arm64-11.0-h6553868_1 
  markupsafe         conda-forge/osx-arm64::markupsafe-2.1.5-py312h024a12e_1 
  minikanren         conda-forge/noarch::minikanren-1.0.3-pyhd8ed1ab_0 
  mpc                conda-forge/osx-arm64::mpc-1.3.1-h8f1351a_1 
  mpfr               conda-forge/osx-arm64::mpfr-4.2.1-hb693164_3 
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_0 
  multipledispatch   conda-forge/noarch::multipledispatch-0.6.0-pyhd8ed1ab_1 
  ncurses            conda-forge/osx-arm64::ncurses-6.5-h7bae524_1 
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1 
  nomkl              conda-forge/noarch::nomkl-1.0-h5ca1d4c_0 
  numpy              conda-forge/osx-arm64::numpy-1.26.4-py312h8442bc7_0 
  openssl            conda-forge/osx-arm64::openssl-3.3.2-h8359307_0 
  packaging          conda-forge/noarch::packaging-24.1-pyhd8ed1ab_0 
  pip                conda-forge/noarch::pip-24.2-pyh8b19718_1 
  psutil             conda-forge/osx-arm64::psutil-6.0.0-py312h024a12e_1 
  pycparser          conda-forge/noarch::pycparser-2.22-pyhd8ed1ab_0 
  pysocks            conda-forge/noarch::pysocks-1.7.1-pyha2e5f31_6 
  pytensor           conda-forge/osx-arm64::pytensor-2.25.4-py312h3f593ad_0 
  pytensor-base      conda-forge/osx-arm64::pytensor-base-2.25.4-py312h02baea5_0 
  python             conda-forge/osx-arm64::python-3.12.6-h739c21a_1_cpython 
  python_abi         conda-forge/osx-arm64::python_abi-3.12-5_cp312 
  pytorch            conda-forge/osx-arm64::pytorch-2.4.0-cpu_generic_py312h6bd8f41_1 
  pyyaml             conda-forge/osx-arm64::pyyaml-6.0.2-py312h024a12e_1 
  readline           conda-forge/osx-arm64::readline-8.2-h92ec313_1 
  requests           conda-forge/noarch::requests-2.32.3-pyhd8ed1ab_0 
  safetensors        conda-forge/osx-arm64::safetensors-0.4.5-py312he431725_0 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312heb3a901_0 
  setuptools         conda-forge/noarch::setuptools-75.1.0-pyhd8ed1ab_0 
  sigtool            conda-forge/osx-arm64::sigtool-0.1.3-h44b9a77_0 
  six                conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 
  sleef              conda-forge/osx-arm64::sleef-3.7-h7783ee8_0 
  sympy              conda-forge/noarch::sympy-1.13.3-pypyh2585a3b_103 
  tapi               conda-forge/osx-arm64::tapi-1300.6.5-h03f4b80_0 
  tk                 conda-forge/osx-arm64::tk-8.6.13-h5083fa2_1 
  toolz              conda-forge/noarch::toolz-0.12.1-pyhd8ed1ab_0 
  tqdm               conda-forge/noarch::tqdm-4.66.5-pyhd8ed1ab_0 
  typing-extensions  conda-forge/noarch::typing-extensions-4.12.2-hd8ed1ab_0 
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_0 
  tzdata             conda-forge/noarch::tzdata-2024a-h8827d51_1 
  urllib3            conda-forge/noarch::urllib3-2.2.3-pyhd8ed1ab_0 
  wheel              conda-forge/noarch::wheel-0.44.0-pyhd8ed1ab_0 
  xz                 conda-forge/osx-arm64::xz-5.2.6-h57fd34a_0 
  yaml               conda-forge/osx-arm64::yaml-0.2.5-h3422bc3_2 
  zstandard          conda-forge/osx-arm64::zstandard-0.23.0-py312h15fbf35_1 
  zstd               conda-forge/osx-arm64::zstd-1.5.6-hb46c0d2_0 

It install with success, but gives me errors (segmentation faults)

from `python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")`
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1627, in cthunk_factory
    module = cache.module_from_key(key=key, lnk=self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 1255, in module_from_key
    module = lnk.compile_cmodule(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1528, in compile_cmodule
    module = c_compiler.compile_str(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 2654, in compile_str
    raise CompileError(
pytensor.link.c.exceptions.CompileError: Compilation failed (return status=1):
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/bin/clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -fPIC -undefined dynamic_lookup -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/core/include -I/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include/python3.12 -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/c_code -L/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib -fvisibility=hidden -o /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mbe23404cc39ec1a668b1ae18701f267b8ee61fabc03b6968263aa4f888d9dec6.so /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mod.cpp
clang++: error: unable to execute command: Segmentation fault: 11
clang++: error: linker command failed due to signal (use -v to see invocation)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 274, in <module>
    t, impl = execute(
              ^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 57, in execute
    f = pytensor.function([], updates=[(c, 0.4 * c + 0.8 * dot(a, b))])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/__init__.py", line 318, in function
    fn = pfunc(
         ^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/pfunc.py", line 465, in pfunc
    return orig_function(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1762, in orig_function
    fn = m.create(defaults)
         ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1654, in create
    _fn, _i, _o = self.linker.make_thunk(
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/basic.py", line 245, in make_thunk
    return self.make_all(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/vm.py", line 1236, in make_all
    raise_with_op(fgraph, node)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/utils.py", line 524, in raise_with_op
    raise exc_value.with_traceback(exc_trace)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1627, in cthunk_factory
    module = cache.module_from_key(key=key, lnk=self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 1255, in module_from_key
    module = lnk.compile_cmodule(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1528, in compile_cmodule
    module = c_compiler.compile_str(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 2654, in compile_str
    raise CompileError(
pytensor.link.c.exceptions.CompileError: Compilation failed (return status=1):
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/bin/clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -fPIC -undefined dynamic_lookup -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/core/include -I/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include/python3.12 -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/c_code -L/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib -fvisibility=hidden -o /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mbe23404cc39ec1a668b1ae18701f267b8ee61fabc03b6968263aa4f888d9dec6.so /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mod.cpp
clang++: error: unable to execute command: Segmentation fault: 11
clang++: error: linker command failed due to signal (use -v to see invocation)

Apply node that caused the error: Gemm{inplace}(<Matrix(float64, shape=(?, ?))>, 0.8, <Matrix(float64, shape=(?, ?))>, <Matrix(float64, shape=(?, ?))>, 0.4)
Toposort index: 0
Inputs types: [TensorType(float64, shape=(None, None)), TensorType(float64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(None, None)), TensorType(float64, shape=())]

HINT: Use a linker other than the C linker to print the inputs' shapes and strides.
HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.
zsh: command not found: from  

Some of those error are discussed here https://discourse.pymc.io/t/environment-not-working-anymore-on-macos/14210

maresb commented 2 months ago

@danieltomasz, could you please try using the pytensor-base package instead of pytensor?

danieltomasz commented 2 months ago
/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12  -c conda-forge pytensor-base
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12

  added / updated specs:
    - pytensor-base

The following NEW packages will be INSTALLED:

  bzip2              conda-forge/osx-arm64::bzip2-1.0.8-h99b78c6_7 
  ca-certificates    conda-forge/osx-arm64::ca-certificates-2024.8.30-hf0a4a13_0 
  cons               conda-forge/noarch::cons-0.4.6-pyhd8ed1ab_0 
  etuples            conda-forge/noarch::etuples-0.3.9-pyhd8ed1ab_0 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_0 
  libblas            conda-forge/osx-arm64::libblas-3.9.0-24_osxarm64_openblas 
  libcblas           conda-forge/osx-arm64::libcblas-3.9.0-24_osxarm64_openblas 
  libcxx             conda-forge/osx-arm64::libcxx-19.1.0-ha82da77_0 
  libexpat           conda-forge/osx-arm64::libexpat-2.6.3-hf9b8971_0 
  libffi             conda-forge/osx-arm64::libffi-3.4.2-h3422bc3_5 
  libgfortran        conda-forge/osx-arm64::libgfortran-5.0.0-13_2_0_hd922786_3 
  libgfortran5       conda-forge/osx-arm64::libgfortran5-13.2.0-hf226fd6_3 
  liblapack          conda-forge/osx-arm64::liblapack-3.9.0-24_osxarm64_openblas 
  libopenblas        conda-forge/osx-arm64::libopenblas-0.3.27-openmp_h517c56d_1 
  libsqlite          conda-forge/osx-arm64::libsqlite-3.46.1-hc14010f_0 
  libzlib            conda-forge/osx-arm64::libzlib-1.3.1-hfb2fe0b_1 
  llvm-openmp        conda-forge/osx-arm64::llvm-openmp-18.1.8-hde57baf_1 
  logical-unificati~ conda-forge/noarch::logical-unification-0.4.6-pyhd8ed1ab_0 
  minikanren         conda-forge/noarch::minikanren-1.0.3-pyhd8ed1ab_0 
  multipledispatch   conda-forge/noarch::multipledispatch-0.6.0-pyhd8ed1ab_1 
  ncurses            conda-forge/osx-arm64::ncurses-6.5-h7bae524_1 
  numpy              conda-forge/osx-arm64::numpy-1.26.4-py312h8442bc7_0 
  openssl            conda-forge/osx-arm64::openssl-3.3.2-h8359307_0 
  pip                conda-forge/noarch::pip-24.2-pyh8b19718_1 
  pytensor-base      conda-forge/osx-arm64::pytensor-base-2.25.4-py312h02baea5_0 
  python             conda-forge/osx-arm64::python-3.12.6-h739c21a_1_cpython 
  python_abi         conda-forge/osx-arm64::python_abi-3.12-5_cp312 
  readline           conda-forge/osx-arm64::readline-8.2-h92ec313_1 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312heb3a901_0 
  setuptools         conda-forge/noarch::setuptools-75.1.0-pyhd8ed1ab_0 
  six                conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 
  tk                 conda-forge/osx-arm64::tk-8.6.13-h5083fa2_1 
  toolz              conda-forge/noarch::toolz-0.12.1-pyhd8ed1ab_0 
  tzdata             conda-forge/noarch::tzdata-2024a-h8827d51_1 
  wheel              conda-forge/noarch::wheel-0.44.0-pyhd8ed1ab_0 
  xz                 conda-forge/osx-arm64::xz-5.2.6-h57fd34a_0 
maresb commented 2 months ago

Ok, so that's installing numpy with openblas. And what happens now if you try and force accelerate?

danieltomasz commented 2 months ago

@maresb as I wrote here it install fine, but when trying to test it give segmentation error https://github.com/pymc-devs/pytensor/issues/1005#issuecomment-2380841854

danieltomasz commented 2 months ago

Also the output from numpy even if I force accelerate in conda ~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12 -c conda-forge pytensor-base "libblas=*=*accelerate"

>>> import numpy as np
>>> np.show_config()
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "16.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "16.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang++",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "arm64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "cross-compiled": true
  },
  "Build Dependencies": {
    "blas": {
      "name": "blas",
      "found": true,
      "version": "3.9.0",
      "detection method": "pkgconfig",
      "include directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include",
      "lib directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib",
      "openblas configuration": "unknown",
      "pc file directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4569863840",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
danieltomasz commented 2 months ago

When installing only numpy with forced accelerate

Python 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:07:06) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> print(np.__version__)
2.1.1
>>> np.show_config()
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "17.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.11",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "17.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang++",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "arm64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "cross-compiled": true
  },
  "Build Dependencies": {
    "blas": {
      "name": "blas",
      "found": true,
      "version": "3.9.0",
      "detection method": "pkgconfig",
      "include directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include",
      "lib directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib",
      "openblas configuration": "unknown",
      "pc file directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig"
    },
    "lapack": {
      "name": "lapack",
      "found": true,
      "version": "3.9.0",
      "detection method": "pkgconfig",
      "include directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include",
      "lib directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib",
      "openblas configuration": "unknown",
      "pc file directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig"
    }
  },
  "Python Information": {
    "path": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}

I need to leave, but I might try something out of box, like if installing pytensor via "pixi" pulls accelerate (there might be something particular to my setup how conda is linking and trying different package manager tool might help), maybe someone with Apple Sillicon can replicate in meantime

maresb commented 2 months ago

Thanks so much for all the diagnosis @danieltomasz!

For when you find some more time, I wonder if lower versions of NumPy might work? For example <2?

danieltomasz commented 2 months ago

Unfortunately neither pixi, nor changing pyhon version to 3.11 or asking for lower version of numpy provide accelarate libraries (it is openblas by default); When I installe numpy via pip it intalled numpy 2.2 with accelerate, but adding pytensor to this envioronment downgrade numpy to one' that is using openblas64

maresb commented 2 months ago

but adding pytensor to this envioronment downgrade numpy to one' that is using openblas64

Thanks @danieltomasz for getting back to me!

Are you able to find some earlier conda-forge version of numpy that works with accelerate on your system?

danieltomasz commented 2 months ago

hi @maresb, I think conda and numpy worked fine earlier (the latest numpy version <2 is from february) , I cannot pinpoint exact moment, but I was probably update to MacOS 15 that changed things ~ 2 weeks ago (also recently I think that conda might changed clang compiler that it uses with the python it ships, but I am not sure about this);

what could be worth to see :

1) If someone with Apple SIllicon and still on MacOS 14 can install pytensorwith accelerate 2) If other people on MacOS 15 and Apple Silicon can reproduce this behaviour

danieltomasz commented 2 months ago

It is just my intuition, but forcing blas to accelerate works, but it creates the create error when running due to problems with compilers on MacOS 15, but it works with openblas

Also there was updates to accelerate in MacOS 15 https://developer.apple.com/documentation/accelerate/blas/ and this discussion might be relevant https://github.com/conda-forge/blas-feedstock/issues/103 and here https://github.com/conda-forge/numpy-feedstock/issues/253

danieltomasz commented 1 month ago

Quick update: when I install via pip

pip install -U --no-binary :all: numpy pytensor

numpy seems to use accelerate but pytensor fails to do so Results of running

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

is below

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)

        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.

        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags=
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 (main, Oct 11 2024, 01:24:59) [Clang 16.0.0 (clang-1600.0.26.3)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.7
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4409437856
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    commands: cc
    linker: ld64
    name: clang
    version: 16.0.0
  c++:
    commands: c++
    linker: ld64
    name: clang
    version: 16.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.11
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/3.12.7/bin/python3.12
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 13.11s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

I am now on MacOS 15.1

lucianopaz commented 1 month ago

@danieltomasz, the phrase Numpy config: (used when the PyTensor flag "blas__ldflags" is empty) is from old theano and we haven't updated it. Currently, numpy's config information is a bit deprecated in light of the newer build chain that they use. For that reason, we had to rely on something different. To get a better picture of what's going on please checkout the branch from this PR and run the following:

import logging

logger = logging.getLogger("pytensor.link.c.cmodule")
logger.setLevel(logging.DEBUG)

import pytensor

After the last import, you should see all of the detailed logs from cmodule. I would like to ask you to paste all the output you get here.

I would like to see what errors pytensor is running into when it tries to determine the default_blas_flags. You'll see that pytensor first tries to link against MKL (which will obviously fail on M* chips) and it should log some information about not finding the libraries. The important thing to me is what happens when it tries to find blas and cblas. Both of these should be importable from Mac's provided accelerate framework, via clang++'s search directories.

danieltomasz commented 1 month ago

The environment wasn't completely clean, but I uninstalled pytensor and numpy and then installed it again via

pip install --no-binary :all: numpy git+https://github.com/pymc-devs/pytensor.git@b314ca67e841b6fc0aac5ea7b5bcc11700565b1e

Output from pytensor

DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Library/Developer/CommandLineTools/usr/lib/clang/16
/Users/daniel/.pyenv/versions/3.12.7/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Lapack + blas
DEBUG (pytensor.link.c.cmodule): Required file 'lapack' not found
DEBUG (pytensor.link.c.cmodule): Required file lapack not found
DEBUG (pytensor.link.c.cmodule): Checking blas alone
DEBUG (pytensor.link.c.cmodule): Required file 'blas' not found
DEBUG (pytensor.link.c.cmodule): Required file blas not found
DEBUG (pytensor.link.c.cmodule): Checking openblas
DEBUG (pytensor.link.c.cmodule): Required file 'openblas' not found
DEBUG (pytensor.link.c.cmodule): Required file openblas not found
DEBUG (pytensor.link.c.cmodule): Failed to identify blas ldflags. Will leave them empty.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

And in the same session

>>> import numpy as np
>>> np.show_config()
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4409437856
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    commands: cc
    linker: ld64
    name: clang
    version: 16.0.0
  c++:
    commands: c++
    linker: ld64
    name: clang
    version: 16.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.11
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/3.12.7/bin/python3.12
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM
lucianopaz commented 1 month ago

Thanks @danieltomasz , the logs say that we couldn’t find a blas library in the search directories. I can think of a couple of dumb causes but I’ll have to ask you to run a couple of other tests.

  1. Can you check what path to an executable you get as pytensor.config.cxx? Is it the system clang or is it the conda clang?
  2. Can you try to run that cxx executable in a terminal as cxx -print-search-dirs? What directories do you get in the libraries entry? Is the conda env lib path included?
  3. Can you verify if there is any file that has the name blas in the conda env lib directory? If there is, what’s the file name extension?
  4. Can you try to run pytensor.link.c.cmodule.try_blas_flags(["-framework", "Accelerate"]) and see if you get something?
danieltomasz commented 1 month ago

Hi @lucianopaz, thanks for all the comments! I installed python in the above case via pyenv (cpython 3.12.7), The result of 1 is pointing into pyenv shim Users/daniel/.pyenv/shims/clang++

❯ /Users/daniel/.pyenv/shims/clang++ --version
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
❯ /Users/daniel/.pyenv/shims/clang++ -print-search-dirs
programs: =/Library/Developer/CommandLineTools/usr/bin
libraries: =/Library/Developer/CommandLineTools/usr/lib/clang/16

Regarding 2 and 3 Earlier in this thread I was trying installing pytensor via miniconda (also managed via pyenv) ; It was either installing openblas or when I was trying to force accelerate via

~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12  -c conda-forge pytensor-base  "libblas=*=*accelerate" 

accelerate was installed but with the following error happens https://github.com/pymc-devs/pytensor/issues/1005#issuecomment-2380841854 this happens also with the newer version of the miniconda I checked if the reason might be my setup, but with pixi conda install I was getting similar errors

Regarding 4, in the pyenv installed cpython:

>>> pytensor.link.c.cmodule.try_blas_flag(["-framework", "Accelerate"])
'-framework Accelerate'
>>>

Would be great if any other person on Apple processor can confirm it, if this is pecular to my setup or something more general (I started to have this problem after update to MacOS 15, MacOS 15 ships accelerate with blas 3.11, I wonder if this might be a problem

lucianopaz commented 1 month ago

That last thing that you tried means that we could add those flags as a check and Mac would link to Accelerate. I'll open a small patch PR so that you can try it out.

lucianopaz commented 1 month ago

@danieltomasz, try this PR out. It should set blas__ldflags to the Accelerate framework.

danieltomasz commented 1 month ago

@lucianopaz seems promising

Python 3.12.7 (main, Oct 31 2024, 00:25:36) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logger = logging.getLogger("pytensor.link.c.cmodule")
>>> logger.setLevel(logging.DEBUG)
>>> import pytensor
DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Library/Developer/CommandLineTools/usr/lib/clang/16
/Users/daniel/.pyenv/versions/3.12.7/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Accelerate framework
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +neon -target-feature +v8.5a -target-feature +zcm -target-feature +zcz -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +lse -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-feature +zcm -target-feature +zcz -target-feature +fullfp16 -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']
danieltomasz commented 1 month ago

but with the above flag results of

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

are defaulting to the error I posted above

       Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)

        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.

        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 (main, Oct 31 2024, 00:25:36) [Clang 16.0.0 (clang-1600.0.26.4)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.7
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "cc",
      "args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "c++",
      "args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "openblas64",
      "found": true,
      "version": "0.3.23.dev",
      "detection method": "pkgconfig",
      "include directory": "/opt/arm64-builds/include",
      "lib directory": "/opt/arm64-builds/lib",
      "openblas configuration": "USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3",
      "pc file directory": "/usr/local/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4335021056",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-q69bfk1p/cp312-macosx_arm64/build/venv/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4
Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 428, in _ldflags
    assert t0 == "-"
           ^^^^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1614, in cthunk_factory
    key = self.cmodule_key()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1266, in cmodule_key
    compile_args=self.compile_args(),
                 ^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 947, in compile_args
    ret += x.c_compile_args(c_compiler=c_compiler)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 496, in c_compile_args
    return ldflags(libs=False, flags=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 359, in ldflags
    return _ldflags(
           ^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 430, in _ldflags
    raise ValueError(f'invalid token "{t}" in ldflags_str: "{ldflags_str}"')
ValueError: invalid token "Accelerate" in ldflags_str: "-framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 274, in <module>
    t, impl = execute(
              ^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 57, in execute
    f = pytensor.function([], updates=[(c, 0.4 * c + 0.8 * dot(a, b))])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/__init__.py", line 318, in function
    fn = pfunc(
         ^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/pfunc.py", line 465, in pfunc
    return orig_function(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1757, in orig_function
    fn = m.create(defaults)
         ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1649, in create
    _fn, _i, _o = self.linker.make_thunk(
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/basic.py", line 245, in make_thunk
    return self.make_all(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/vm.py", line 1236, in make_all
    raise_with_op(fgraph, node)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/utils.py", line 524, in raise_with_op
    raise exc_value.with_traceback(exc_trace)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1614, in cthunk_factory
    key = self.cmodule_key()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1266, in cmodule_key
    compile_args=self.compile_args(),
                 ^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 947, in compile_args
    ret += x.c_compile_args(c_compiler=c_compiler)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 496, in c_compile_args
    return ldflags(libs=False, flags=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 359, in ldflags
    return _ldflags(
           ^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 430, in _ldflags
    raise ValueError(f'invalid token "{t}" in ldflags_str: "{ldflags_str}"')
ValueError: invalid token "Accelerate" in ldflags_str: "-framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib"
Apply node that caused the error: Gemm{inplace}(<Matrix(float64, shape=(?, ?))>, 0.8, <Matrix(float64, shape=(?, ?))>, <Matrix(float64, shape=(?, ?))>, 0.4)
Toposort index: 0
Inputs types: [TensorType(float64, shape=(None, None)), TensorType(float64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(None, None)), TensorType(float64, shape=())]

HINT: Use a linker other than the C linker to print the inputs' shapes and strides.
HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

Results of

PYTENSOR_FLAGS='optimizer=None,exception_verbosity=high'  python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 11.61s on ERROR, unable to tell if PyTensor used the cpu:
[dot(<Matrix(float64, shape=(?, ?))>, <Matrix(float64, shape=(?, ?))>), ExpandDims{axes=[0, 1]}(0.8), Mul(ExpandDims{axes=[0, 1]}.0, dot.0), ExpandDims{axes=[0, 1]}(0.4), Mul(ExpandDims{axes=[0, 1]}.0, <Matrix(float64, shape=(?, ?))>), Add(Mul.0, Mul.0)].
lucianopaz commented 1 month ago

Awesome @danieltomasz! I can reproduce that problem locally now. The latest commit to the PR I had mentioned before should have fixed it. Let me know if it works for you. If it did, I'll try to setup a test on Mac ARM in our CI matrix so that this can be verified.

danieltomasz commented 1 month ago

Thanks @lucianopaz, everything seems to work ok now with cpython and pip install!

Python 3.12.7 (main, Oct 31 2024, 00:49:16) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logger = logging.getLogger("pytensor.link.c.cmodule")
>>> logger.setLevel(logging.DEBUG)
>>> import pytensor
DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Library/Developer/CommandLineTools/usr/lib/clang/16
/Users/daniel/.pyenv/versions/3.12.7/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Accelerate framework
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +neon -target-feature +v8.5a -target-feature +zcm -target-feature +zcz -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel/blogspot-downloader -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel/blogspot-downloader -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +lse -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-feature +zcm -target-feature +zcz -target-feature +fullfp16 -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel/blogspot-downloader -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel/blogspot-downloader -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']

and

 python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)

        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.

        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 (main, Oct 31 2024, 00:49:16) [Clang 16.0.0 (clang-1600.0.26.4)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.7
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4405705904
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -I/opt/homebrew/opt/openblas/include
    commands: gcc
    linker: ld64
    linker args: -L/opt/homebrew/opt/openblas/lib, -I/opt/homebrew/opt/openblas/include
    name: clang
    version: 16.0.0
  c++:
    commands: c++
    linker: ld64
    linker args: -L/opt/homebrew/opt/openblas/lib
    name: clang
    version: 16.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.11
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/3.12.7/bin/python3.12
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 16.22s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
lucianopaz commented 3 weeks ago

@danieltomasz, this should now be fixed with #1056. If you want, you can try to run from the current pytensor main branch and check if it works. I had to do a bunch of extra changes to ensure compilation actually used blas symbols.

danieltomasz commented 3 weeks ago

nice, after running

 python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

flags are different now

blas__ldflags= -framework Accelerate -Wl,-rpath,/Users/daniel/.pyenv/versions/3.12.7/lib

And the time of running is shorter (down to around 10-13s from 14-16s)

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 10.02s on CPU (with direct PyTensor binding to blas).
lucianopaz commented 3 weeks ago

nice, after running

 python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

flags are different now

blas__ldflags= -framework Accelerate -Wl,-rpath,/Users/daniel/.pyenv/versions/3.12.7/lib

And the time of running is shorter (down to around 10-13s from 14-16s)

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 10.02s on CPU (with direct PyTensor binding to blas).

Yes, I changed the flags to make them aligned with what other blas flag specs that we use. And the execution time should be shorter because it’s actually linking to accelerate now. Before, it was failing to do so because of things that were handing downstream.

Edderic commented 3 weeks ago

@danieltomasz, this should now be fixed with #1056. If you want, you can try to run from the current pytensor main branch and check if it works. I had to do a bunch of extra changes to ensure compilation actually used blas symbols.

This is great news! Is there a way to have my conda environment use this version of PyTensor? Or alternatively, when is the next release going to be such that this code is available to be installed normally via conda?

ricardoV94 commented 3 weeks ago

@Edderic just did: https://github.com/pymc-devs/pytensor/releases/tag/rel-2.26.0

However if you are using PyTensor for PyMC, that will also need a bump in the dependency due to major changes.

aurimas-ww commented 3 weeks ago

I've been following this thread as I recently got an M1 Mac, too, and I'm still not getting Accelerate to work with 2.26.0 :/

Here's the output with logging enabled in a fresh conda environment created with conda create -n pt -c coda-forge pytensor

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logger = logging.getLogger("pytensor.link.c.cmodule")
>>> logger.setLevel(logging.DEBUG)
>>> import pytensor
DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Users/aurimas.racas/micromamba/envs/pt/lib/clang/18
/Users/aurimas.racas/micromamba/envs/pt/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Accelerate framework
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Users/aurimas.racas/micromamba/envs/pt/bin/clang-18" -cc1 -triple arm64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -funwind-tables=1 -target-sdk-version=15.1 -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -fbuiltin-headers-in-system-modules -fdefine-target-os-macros -target-cpu apple-m1 -target-feature +zcm -target-feature +zcz -target-feature +v8.5a -target-feature +crc -target-feature +dotprod -target-feature +complxnum -target-feature +fp-armv8 -target-feature +jsconv -target-feature +lse -target-feature +pauth -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +neon -target-abi darwinpcs -debugger-tuning=lldb -fdebug-compilation-dir=/Users/aurimas.racas -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/aurimas.racas -resource-dir /Users/aurimas.racas/micromamba/envs/pt/lib/clang/18 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Users/aurimas.racas/micromamba/envs/pt/lib/clang/18/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fmax-type-align=16 -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Users/aurimas.racas/micromamba/envs/pt/bin/clang-18" -cc1 -triple arm64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -funwind-tables=1 -target-sdk-version=15.1 -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -fbuiltin-headers-in-system-modules -fdefine-target-os-macros -target-cpu apple-m1 -target-feature +zcm -target-feature +zcz -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +complxnum -target-feature +fp-armv8 -target-feature +fullfp16 -target-feature +fp16fml -target-feature +jsconv -target-feature +lse -target-feature +pauth -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-abi darwinpcs -debugger-tuning=lldb -fdebug-compilation-dir=/Users/aurimas.racas -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/aurimas.racas -resource-dir /Users/aurimas.racas/micromamba/envs/pt/lib/clang/18 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Users/aurimas.racas/micromamba/envs/pt/lib/clang/18/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fmax-type-align=16 -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-framework', 'Accelerate', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[16338]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aurimas.racas/micromamba/envs/pt/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Accelerate framework flag failed 
DEBUG (pytensor.link.c.cmodule): Checking Lapack + blas
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-L/Users/aurimas.racas/micromamba/envs/pt/lib', '-llapack', '-lblas', '-lcblas', '-lm', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[16342]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aurimas.racas/micromamba/envs/pt/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aurimas.racas/micromamba/envs/pt/lib', '-llapack', '-lblas', '-lcblas', '-lm', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Checking blas alone
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-L/Users/aurimas.racas/micromamba/envs/pt/lib', '-lblas', '-lcblas', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[16346]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aurimas.racas/micromamba/envs/pt/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aurimas.racas/micromamba/envs/pt/lib', '-lblas', '-lcblas', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Checking openblas
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-L/Users/aurimas.racas/micromamba/envs/pt/lib', '-lopenblas', '-lgfortran', '-lgomp', '-lm', '-fopenmp', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[16352]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aurimas.racas/micromamba/envs/pt/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aurimas.racas/micromamba/envs/pt/lib', '-lopenblas', '-lgfortran', '-lgomp', '-lm', '-fopenmp', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Failed to identify blas ldflags. Will leave them empty.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
>>> 

Replicating some of the other tests that @lucianopaz asked above:

Can you check what path to an executable you get as pytensor.config.cxx? Is it the system clang or is it the conda clang? It seems to be the conda one:

>> pytensor.config.cxx
'/Users/aurimas.racas/micromamba/envs/pt/bin/clang++' 

Can you try to run that cxx executable in a terminal as cxx -print-search-dirs? What directories do you get in the libraries entry? Is the conda env lib path included?

Yes it is.

> /Users/aurimas.racas/micromamba/envs/pt/bin/clang++ -print-search-dirs
programs: =/Users/aurimas.racas/micromamba/envs/pt/bin
libraries: =/Users/aurimas.racas/micromamba/envs/pt/lib/clang/18

Can you verify if there is any file that has the name blas in the conda env lib directory? If there is, what’s the file name extension?

➜  ~ ls /Users/aurimas.racas/micromamba/envs/pt/lib | grep blas    
libblas.3.dylib
libblas.dylib
libcblas.3.dylib
libcblas.dylib
libopenblas.0.dylib
libopenblas.a
libopenblas.dylib
libopenblas_armv8p-r0.3.28.dylib
libopenblas_vortexp-r0.3.28.a
libopenblas_vortexp-r0.3.28.dylib
libopenblasp-r0.3.28.dylib

Can you try to run pytensor.link.c.cmodule.try_blas_flags(["-framework", "Accelerate"]) and see if you get something?

>>> pytensor.link.c.cmodule.try_blas_flag(["-framework", "Accelerate"]) 
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-framework', 'Accelerate']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[17129]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aurimas.racas/micromamba/envs/pt/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"

If I try the same commands in an environment with pytensor=2.25.5:

Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logger = logging.getLogger("pytensor.link.c.cmodule")
>>> logger.setLevel(logging.DEBUG)
>>> import pytensor
DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Users/aurimas.racas/micromamba/envs/pt225/lib/clang/17
/Users/aurimas.racas/micromamba/envs/pt225/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Lapack + blas
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Users/aurimas.racas/micromamba/envs/pt225/bin/clang-17" -cc1 -triple arm64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -funwind-tables=1 -target-sdk-version=15.1 -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -target-cpu apple-m1 -target-feature +neon -target-feature +v8.5a -target-feature +zcm -target-feature +zcz -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/aurimas.racas -resource-dir /Users/aurimas.racas/micromamba/envs/pt225/lib/clang/17 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Users/aurimas.racas/micromamba/envs/pt225/lib/clang/17/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -fdebug-compilation-dir=/Users/aurimas.racas -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Users/aurimas.racas/micromamba/envs/pt225/bin/clang-17" -cc1 -triple arm64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -funwind-tables=1 -target-sdk-version=15.1 -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -target-cpu apple-m1 -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +lse -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-feature +zcm -target-feature +zcz -target-feature +fullfp16 -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/aurimas.racas -resource-dir /Users/aurimas.racas/micromamba/envs/pt225/lib/clang/17 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Users/aurimas.racas/micromamba/envs/pt225/lib/clang/17/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -fdebug-compilation-dir=/Users/aurimas.racas -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']
DEBUG (pytensor.link.c.cmodule): Supplied flags  failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aurimas.racas/micromamba/envs/pt225/lib', '-llapack', '-lblas', '-lcblas', '-lm', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt225/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Checking blas alone
DEBUG (pytensor.link.c.cmodule): Supplied flags  failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aurimas.racas/micromamba/envs/pt225/lib', '-lblas', '-lcblas', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt225/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Checking openblas
DEBUG (pytensor.link.c.cmodule): Supplied flags  failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aurimas.racas/micromamba/envs/pt225/lib', '-lopenblas', '-lgfortran', '-lgomp', '-lm', '-fopenmp', '-Wl,-rpath,/Users/aurimas.racas/micromamba/envs/pt225/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Failed to identify blas ldflags. Will leave them empty.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
>>> 
>>> pytensor.config.cxx
'/Users/aurimas.racas/micromamba/envs/pt225/bin/clang++'
➜  ~ /Users/aurimas.racas/micromamba/envs/pt225/bin/clang++ -print-search-dirs
programs: =/Users/aurimas.racas/micromamba/envs/pt225/bin
libraries: =/Users/aurimas.racas/micromamba/envs/pt225/lib/clang/17
➜  ~ ls /Users/aurimas.racas/micromamba/envs/pt225/lib | grep blas            
libblas.3.dylib
libblas.dylib
libcblas.3.dylib
libcblas.dylib
libopenblas.0.dylib
libopenblas.a
libopenblas.dylib
libopenblas_armv8p-r0.3.28.dylib
libopenblas_vortexp-r0.3.28.a
libopenblas_vortexp-r0.3.28.dylib
libopenblasp-r0.3.28.dylib
>>> pytensor.link.c.cmodule.try_blas_flag(["-framework", "Accelerate"]) 
''

From what I can see, pytensor=2.26.0 has clang18 installed in the environment, and pytensor=2.25.5 has clang17. Perhaps that's the issue?

areding commented 3 weeks ago

This didn't work for me on a fresh Miniforge install on MacOS 15.1. My pytensor logging output is below.

I created my environment using:

mamba create -c conda-forge -c nodefaults -n pymc_macos15 pytensor

The pytensor import logs:

DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Users/aaron/miniforge3/envs/pytensor_test/lib/clang/18
/Users/aaron/miniforge3/envs/pytensor_test/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Accelerate framework
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Users/aaron/miniforge3/envs/pytensor_test/bin/clang-18" -cc1 -triple arm64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -funwind-tables=1 -target-sdk-version=15.1 -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -fbuiltin-headers-in-system-modules -fdefine-target-os-macros -target-cpu apple-m1 -target-feature +zcm -target-feature +zcz -target-feature +v8.5a -target-feature +crc -target-feature +dotprod -target-feature +complxnum -target-feature +fp-armv8 -target-feature +jsconv -target-feature +lse -target-feature +pauth -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +neon -target-abi darwinpcs -debugger-tuning=lldb -fdebug-compilation-dir=/Users/aaron -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/aaron -resource-dir /Users/aaron/miniforge3/envs/pytensor_test/lib/clang/18 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Users/aaron/miniforge3/envs/pytensor_test/lib/clang/18/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fmax-type-align=16 -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Users/aaron/miniforge3/envs/pytensor_test/bin/clang-18" -cc1 -triple arm64-apple-macosx11.0.0 -Wundef-prefix=TARGET_OS_ -Werror=undef-prefix -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -ffp-contract=on -fno-rounding-math -funwind-tables=1 -target-sdk-version=15.1 -fcompatibility-qualified-id-block-type-checking -fvisibility-inlines-hidden-static-local-var -fbuiltin-headers-in-system-modules -fdefine-target-os-macros -target-cpu apple-m1 -target-feature +zcm -target-feature +zcz -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +complxnum -target-feature +fp-armv8 -target-feature +fullfp16 -target-feature +fp16fml -target-feature +jsconv -target-feature +lse -target-feature +pauth -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-abi darwinpcs -debugger-tuning=lldb -fdebug-compilation-dir=/Users/aaron -target-linker-version 711 -v -fcoverage-compilation-dir=/Users/aaron -resource-dir /Users/aaron/miniforge3/envs/pytensor_test/lib/clang/18 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Users/aaron/miniforge3/envs/pytensor_test/lib/clang/18/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -ferror-limit 19 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fmax-type-align=16 -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-framework', 'Accelerate', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[62467]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aaron/miniforge3/envs/pytensor_test/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Accelerate framework flag failed
DEBUG (pytensor.link.c.cmodule): Checking Lapack + blas
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-L/Users/aaron/miniforge3/envs/pytensor_test/lib', '-llapack', '-lblas', '-lcblas', '-lm', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[62470]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aaron/miniforge3/envs/pytensor_test/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aaron/miniforge3/envs/pytensor_test/lib', '-llapack', '-lblas', '-lcblas', '-lm', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Checking blas alone
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-L/Users/aaron/miniforge3/envs/pytensor_test/lib', '-lblas', '-lcblas', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[62473]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aaron/miniforge3/envs/pytensor_test/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aaron/miniforge3/envs/pytensor_test/lib', '-lblas', '-lcblas', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Checking openblas
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-L/Users/aaron/miniforge3/envs/pytensor_test/lib', '-lopenblas', '-lgfortran', '-lgomp', '-lm', '-fopenmp', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib']
failed with error message b"clang++: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]\ndyld[62476]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv\n  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld\n  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aaron/miniforge3/envs/pytensor_test/lib/libtapi.dylib\nclang++: error: unable to execute command: Abort trap: 6\nclang++: error: linker command failed due to signal (use -v to see invocation)\n"
DEBUG (pytensor.link.c.cmodule): Supplied flags '' failed to compile
DEBUG (pytensor.link.c.cmodule): Supplied flags ['-L/Users/aaron/miniforge3/envs/pytensor_test/lib', '-lopenblas', '-lgfortran', '-lgomp', '-lm', '-fopenmp', '-Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib'] failed to compile
DEBUG (pytensor.link.c.cmodule): Failed to identify blas ldflags. Will leave them empty.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

Running

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

outputs:

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

...

Some PyTensor flags:
    blas__ldflags=
    compiledir= /Users/aaron/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ]
    sys.prefix= /Users/aaron/miniforge3/envs/pymc_macos15
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None
Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /Users/aaron/miniforge3/envs/pymc_macos15/include
    lib directory: /Users/aaron/miniforge3/envs/pymc_macos15/lib
    name: blas
    openblas configuration: unknown
    pc file directory: /Users/aaron/miniforge3/envs/pymc_macos15/lib/pkgconfig
    version: 3.9.0
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4569863840
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem,
      /Users/aaron/miniforge3/envs/pymc_macos15/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/aaron/miniforge3/envs/pymc_macos15=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include,
      -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/aaron/miniforge3/envs/pymc_macos15/lib,
      -L/Users/aaron/miniforge3/envs/pymc_macos15/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong,
      -O2, -pipe, -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/aaron/miniforge3/envs/pymc_macos15=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include,
      -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  c++:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/aaron/miniforge3/envs/pymc_macos15=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include,
      -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang++
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/aaron/miniforge3/envs/pymc_macos15/lib,
      -L/Users/aaron/miniforge3/envs/pymc_macos15/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong,
      -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0,
      -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/aaron/miniforge3/envs/pymc_macos15=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/aaron/miniforge3/envs/pymc_macos15/include,
      -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.8
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  cross-compiled: true
  host:
    cpu: arm64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/aaron/miniforge3/envs/pymc_macos15/bin/python
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

You can find the C code in this temporary file: /var/folders/b8/7kxz8b7579n0gp9kb3wmcf880000gn/T/pytensor_compilation_error_dsrc8_jd
ERROR (pytensor.graph.rewriting.basic): Rewrite failure due to: constant_folding
ERROR (pytensor.graph.rewriting.basic): node: ExpandDims{axes=[0, 1]}(0.8)
ERROR (pytensor.graph.rewriting.basic): TRACEBACK:
ERROR (pytensor.graph.rewriting.basic): Traceback (most recent call last):
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/graph/rewriting/basic.py", line 1909, in process_node
    replacements = node_rewriter.transform(fgraph, node)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/graph/rewriting/basic.py", line 1081, in transform
    return self.fn(fgraph, node)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/tensor/rewriting/basic.py", line 1117, in constant_folding
    thunk = node.op.make_thunk(node, storage_map, compute_map, no_recycling=[])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1627, in cthunk_factory
    module = cache.module_from_key(key=key, lnk=self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 1255, in module_from_key
    module = lnk.compile_cmodule(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1528, in compile_cmodule
    module = c_compiler.compile_str(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 2677, in compile_str
    raise CompileError(
pytensor.link.c.exceptions.CompileError: Compilation failed (return status=1):
/Users/aaron/miniforge3/envs/pymc_macos15/bin/clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -fPIC -undefined dynamic_lookup -I/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/numpy/core/include -I/Users/aaron/miniforge3/envs/pymc_macos15/include/python3.12 -I/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/pytensor/link/c/c_code -L/Users/aaron/miniforge3/envs/pymc_macos15/lib -fvisibility=hidden -o /Users/aaron/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64/tmpkj3rjfi4/mb782a9925f26f74c46a75d98e1484e89ff6c5c482e4b63d738d2bb93e667f8f6.so /Users/aaron/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64/tmpkj3rjfi4/mod.cpp
dyld[54048]: Symbol not found: __ZNK4tapi2v119LinkerInterfaceFile28getPlatformsAndMinDeploymentEv
  Referenced from: <16BFD524-5ED0-3DE1-B23F-84EE5092744F> /Library/Developer/CommandLineTools/usr/bin/ld
  Expected in:     <15C501C6-0EF4-3E32-9C14-04EC4CD23D35> /Users/aaron/miniforge3/envs/pymc_macos15/lib/libtapi`.`dylib
clang++: error: unable to execute command: Abort trap: 6
clang++: error: linker command failed due to signal (use -v to see invocation)

I then ran:

mamba install "libblas=*=*accelerate"

Same result when trying the above. I tried a few combinations and orders of this, including installing pytensor-base first.

Finally I gave up and created a fresh environment with only python=3.12, then used pip to install pytensor. Now the check_blas.py output is:

Some PyTensor flags:
    blas__ldflags= -framework Accelerate -Wl,-rpath,/Users/aaron/miniforge3/envs/pymc_macos15/lib
    compiledir= /Users/aaron/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ]
    sys.prefix= /Users/aaron/miniforge3/envs/pymc_macos15
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
/Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "cc",
      "args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "c++",
      "args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "openblas64",
      "found": true,
      "version": "0.3.23.dev",
      "detection method": "pkgconfig",
      "include directory": "/opt/arm64-builds/include",
      "lib directory": "/opt/arm64-builds/lib",
      "openblas configuration": "USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3",
      "pc file directory": "/usr/local/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4335021056",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-q69bfk1p/cp312-macosx_arm64/build/venv/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
Numpy dot module: numpy
Numpy location: /Users/aaron/miniforge3/envs/pymc_macos15/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 3.68s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
lucianopaz commented 3 weeks ago

Thanks @aurimas-ww and @areding. I also started to run into a similar problem in a different environment. I think that I know the cause, it’s related to the linker that Mac deployed with Xcode 15.

Both of your logs say that a symbol wasn’t found in the tapi library, but the next line says that the linker actually died trying to go through that dylib with a signal. On my machine I managed to see a segfault with signal 11. Googling around, I found this thread that might have the solution we need. The explanatory post says that Xcode 15 brought on a new linker. This linker seems to not like the layout of some dylib and dies. We can luckily ask Xcode’s linker to behave like the good old linker by supplying some flags to the compiler. I haven’t gotten around to implementing this in pytensor yet, but maybe you could try to add a configuration flag for the extra compile flags that has -ld64 or -Wl,-ld64 (I’m not sure if it’s a linker flag or a compiler flag yet). Maybe next week, I’ll be able to sit down and test this out properly.

aurimas-ww commented 3 weeks ago

Not sure if that's the right way to do it, but adding these flags (either just ld64 or both) to try_blas_flag gives a different error message:

>>> pytensor.link.c.cmodule.try_blas_flag(["-Wl", "-ld64", "-framework", "Accelerate"]) 
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-Wl', '-ld64', '-framework', 'Accelerate']
failed with error message b"dyld[23414]: Library not loaded: @rpath/libc++.1.dylib\n  Referenced from: <DE0B8C7D-A117-3BDE-8DE9-252F4D6C9054> /private/var/folders/hs/rg3ptf7571n92r08b54rz7g40000gq/T/try_blas_y1v18bkk\n  Reason: no LC_RPATH's found\n"
lucianopaz commented 3 weeks ago

Not sure if that's the right way to do it, but adding these flags (either just ld64 or both) to try_blas_flag gives a different error message:

>>> pytensor.link.c.cmodule.try_blas_flag(["-Wl", "-ld64", "-framework", "Accelerate"]) 
DEBUG (pytensor.link.c.cmodule): try_blas_flags of flags: ['-Wl', '-ld64', '-framework', 'Accelerate']
failed with error message b"dyld[23414]: Library not loaded: @rpath/libc++.1.dylib\n  Referenced from: <DE0B8C7D-A117-3BDE-8DE9-252F4D6C9054> /private/var/folders/hs/rg3ptf7571n92r08b54rz7g40000gq/T/try_blas_y1v18bkk\n  Reason: no LC_RPATH's found\n"

That’s because you left out the rpath from the compilation flags. You don’t need to drop them, just add a new flag

lucianopaz commented 2 weeks ago

@aurimas-ww and @areding, we just merged #1083. Could you please install pytensor from the current state of the main branch and check if your issues go away? We can't be sure if this was fixed with the CI runs only because we haven't found a snippet that fails systematically, but the people that we've asked so far told us that their setup is working now.

areding commented 2 weeks ago

That did it. Thank you!

Some PyTensor flags:
    blas__ldflags= -framework Accelerate -Wl,-rpath,/Users/aaron/miniforge3/envs/pytensor_test/lib
    compiledir= /Users/aaron/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ]
    sys.prefix= /Users/aaron/miniforge3/envs/pytensor_test
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
/Users/aaron/miniforge3/envs/pytensor_test/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "cc",
      "args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "c++",
      "args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "openblas64",
      "found": true,
      "version": "0.3.23.dev",
      "detection method": "pkgconfig",
      "include directory": "/opt/arm64-builds/include",
      "lib directory": "/opt/arm64-builds/lib",
      "openblas configuration": "USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3",
      "pc file directory": "/usr/local/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4335021056",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-q69bfk1p/cp312-macosx_arm64/build/venv/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
Numpy dot module: numpy
Numpy location: /Users/aaron/miniforge3/envs/pytensor_test/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 3.65s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
aurimas-ww commented 2 weeks ago

Same here - I can confirm it works on my machine, too. Thanks!!

scroobiustrip commented 2 weeks ago

Hey, can you guys post the steps you used to get a successful installation? I'm able to replicate each of the steps outlined by @aurimas-ww, but when pip installing pytensor main branch from github, I still get the same results 😞.

aurimas-ww commented 2 weeks ago

@scroobiustrip - I just tried it in a fresh virtual environment, simple (uv) pip install - and it worked:

mkdir foo
cd foo
uv venv
source .venv/bin/activate
uv pip install https://github.com/pymc-devs/pytensor.git
lucianopaz commented 2 weeks ago

@ricardoV94, we should make a release with the patch. Just a bug fix release. That way everyone can just use conda or pip as usual

ricardoV94 commented 2 weeks ago

@ricardoV94, we should make a release with the patch. Just a bug fix release. That way everyone can just use conda or pip as usual

@lucianopaz already did and also the latest PyMC is linked to it

lucianopaz commented 2 weeks ago

@ricardoV94, we should make a release with the patch. Just a bug fix release. That way everyone can just use conda or pip as usual

@lucianopaz already did and also the latest PyMC is linked to it

I don't see #1083 commits included in the latest release. Those were the ones that made the last errors go away.

ricardoV94 commented 2 weeks ago

@lucianopaz my bad. I'll release now. Since it's not a major release it should become automatically compatible with PyMC

ricardoV94 commented 2 weeks ago

@lucianopaz can you edit the PR title to be more informative than "LD64"?

lucianopaz commented 2 weeks ago

@lucianopaz can you edit the PR title to be more informative than "LD64"?

Done

ricardoV94 commented 2 weeks ago

Patch is in https://github.com/pymc-devs/pytensor/releases/tag/rel-2.26.3

Should be available in the common channels soon