microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.36k stars 525 forks source link

Build/Install issues on x86_64 Linux #1400

Open dushankw opened 2 weeks ago

dushankw commented 2 weeks ago

Describe the bug Presidio is failing to build/install against Python 3.11 (officially supported per docs) and 3.12 on x86_64 Linux

Having tried both spaCy and Stanza as per https://microsoft.github.io/presidio/installation/ I am always encountering the following issue, seemingly a version incompatibility between numpy and something else (probably a compiled library lower down in the import graph).

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

I have replicated the same issue in a clean container using the official Python upstream image

Thank you for looking into it :pray:

To Reproduce

  1. Create the following Dockerfile

    $ cat Dockerfile 
    FROM docker.io/library/python:3.11.9
    RUN pip install presidio_analyzer && pip install presidio_anonymizer && python -m spacy download en_core_web_lg
  2. Build it

    $ podman build .
    STEP 1/2: FROM docker.io/library/python:3.11.9
    STEP 2/2: RUN pip install presidio_analyzer && pip install presidio_anonymizer && python -m spacy download en_core_web_lg
    Collecting presidio_analyzer
    Downloading presidio_analyzer-2.2.354-py3-none-any.whl.metadata (2.6 kB)
    Collecting spacy<4.0.0,>=3.4.4 (from presidio_analyzer)
    Downloading spacy-3.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
    <CUT_FOR_BREVITY>
  3. Observe the error towards the end of the build (NOTE: the warning about running as root and the venv is noise as this is in a container)

    Installing collected packages: pycryptodome, presidio_anonymizer
    Successfully installed presidio_anonymizer-2.2.354 pycryptodome-3.20.0
    WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
    Traceback (most recent call last):
    File "<frozen runpy>", line 189, in _run_module_as_main
    File "<frozen runpy>", line 148, in _get_module_details
    File "<frozen runpy>", line 112, in _get_module_details
    File "/usr/local/lib/python3.11/site-packages/spacy/__init__.py", line 6, in <module>
    from .errors import setup_default_warnings
    File "/usr/local/lib/python3.11/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
    File "/usr/local/lib/python3.11/site-packages/spacy/compat.py", line 39, in <module>
    from thinc.api import Optimizer  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/thinc/api.py", line 1, in <module>
    from .backends import (
    File "/usr/local/lib/python3.11/site-packages/thinc/backends/__init__.py", line 17, in <module>
    from .cupy_ops import CupyOps
    File "/usr/local/lib/python3.11/site-packages/thinc/backends/cupy_ops.py", line 16, in <module>
    from .numpy_ops import NumpyOps
    File "thinc/backends/numpy_ops.pyx", line 1, in init thinc.backends.numpy_ops
    ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
    Error: building at STEP "RUN pip install presidio_analyzer && pip install presidio_anonymizer && python -m spacy download en_core_web_lg": while running runtime: exit status 1

Note: Using Stanza instead of spaCy we are able to successfully build the container (install the libraries), but we hit the same error as soon as we try to use the library, eg:

from presidio_analyzer import AnalyzerEngine

Will trigger the same error

Expected behavior Able to install the library, import it and run the demo code (https://microsoft.github.io/presidio/getting_started/)

Screenshots N/A

Additional context Looking at the official Docker image, it seems 3.9 is being used

$ podman run --rm -it mcr.microsoft.com/presidio-analyzer bash
root@d9fed78f0a52:/usr/bin/presidio-analyzer# python -V
Python 3.9.19

Trying to build against this exact version of Python yields the same error

codingbandit commented 2 weeks ago

We began seeing this issue in the past day or two as well. Following.

omri374 commented 2 weeks ago

Thanks for posting. Looks like an issue been spacy and numpy. Consider trying to pip install numpy as well

JosephCatrambone commented 2 weeks ago

A new version of numpy got released two days ago. I had some luck pinning numpy==1.26.4.

omri374 commented 2 weeks ago

Root cause: https://github.com/explosion/thinc/issues/939