tensorflow / models

Models and examples built with TensorFlow
Other
76.97k stars 45.79k forks source link

Dockerfile installs two tensorflow versions when building #9911

Open eirikaso opened 3 years ago

eirikaso commented 3 years ago

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/238922e98dd0e8254b5c0921b241a1f5a151782f/research/object_detection/dockerfiles/tf2/Dockerfile

2. Describe the bug

Some of the python requirements that gets installed from the "/models/research/object_detection/packages/tf2/setup.py" leads to the installation of the latest tensorflow version (2.4.1). Since the dockerfile is building with the tensorflow/tensorflow:2.2.0-gpu image as starting point, I now have tensorflow:2.2.0-gpu AND tensorflow:2.4.1 installed.

When I try to train on a network I get the following output:

python object_detection/model_main_tf2.py --pipeline_config_path=${PIPELINE_CONFIG_PATH} --model_dir=${MODEL_DIR} --alsologtostderr

Traceback (most recent call last): File "object_detection/model_main_tf2.py", line 31, in import tensorflow.compat.v2 as tf File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/init.py", line 436, in _ll.load_library(_main_dir) File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 153, in load_library py_tf.TF_LoadLibrary(lib) tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.6/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb

3. Steps to reproduce

Clone the "models" repository and install using docker. Follow installation instructions here: https://github.com/tensorflow/models/blob/238922e98dd0e8254b5c0921b241a1f5a151782f/research/object_detection/g3doc/tf2.md

Start training on a network: https://github.com/tensorflow/models/blob/238922e98dd0e8254b5c0921b241a1f5a151782f/research/object_detection/g3doc/tf2_training_and_evaluation.md


 PIPELINE_CONFIG_PATH={path to pipeline config file}
 MODEL_DIR={path to model directory}
 python object_detection/model_main_tf2.py \
     --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
     --model_dir=${MODEL_DIR} \
     --alsologtostderr

4. Expected behavior

I want to use the tensorflow:2.2.0-gpu or tensorflow:X.X.X-gpu. I do not want another tensorflow version to get installed during building as this screws up the environment.

5. Additional context

After the python requirements from the https://github.com/tensorflow/models/blob/238922e98dd0e8254b5c0921b241a1f5a151782f/research/object_detection/packages/tf2/setup.py file is installed while building the dockerfile, I get the following output indicating that all tensorflow related packages has been updated to version 2.4 as well.

Successfully built object-detection avro-python3 crcmod dill future docopt pycocotools kaggle py-cpuinfo python-slugify seqeval promise Installing collected packages: pyparsing, pytz, packaging, numpy, googleapis-common-protos, google-auth, wheel, threadpoolctl, text-unidecode, python-dateutil, pillow, kiwisolver, joblib, httplib2, grpcio, google-crc32c, google-api-core, cycler, uritemplate, typing-extensions, typeguard, tqdm, tensorflow-metadata, tensorflow-estimator, scikit-learn, python-slugify, proto-plus, promise, pbr, matplotlib, importlib-resources, google-resumable-media, google-cloud-core, google-auth-httplib2, future, flatbuffers, docopt, dm-tree, dill, dataclasses, Cython, attrs, tf-slim, tensorflow-model-optimization, tensorflow-hub, tensorflow-datasets, tensorflow-addons, tensorflow, seqeval, sentencepiece, pyyaml, pymongo, pydot, pycocotools, pyarrow, py-cpuinfo, psutil, pandas, opencv-python-headless, opencv-python, oauth2client, mock, kaggle, hdfs, google-cloud-bigquery, google-api-python-client, gin-config, fastavro, crcmod, avro-python3, tf-models-official, lvis, contextlib2, apache-beam, object-detection Successfully installed Cython-0.29.23 apache-beam-2.28.0 attrs-20.3.0 avro-python3-1.9.2.1 contextlib2-0.6.0.post1 crcmod-1.7 cycler-0.10.0 dataclasses-0.8 dill-0.3.1.1 dm-tree-0.1.6 docopt-0.6.2 fastavro-1.3.5 flatbuffers-1.12 future-0.18.2 gin-config-0.4.0 google-api-core-1.26.3 google-api-python-client-2.2.0 google-auth-1.28.1 google-auth-httplib2-0.1.0 google-cloud-bigquery-2.13.1 google-cloud-core-1.6.0 google-crc32c-1.1.2 google-resumable-media-1.2.0 googleapis-common-protos-1.53.0 grpcio-1.32.0 hdfs-2.6.0 httplib2-0.17.4 importlib-resources-5.1.2 joblib-1.0.1 kaggle-1.5.12 kiwisolver-1.3.1 lvis-0.5.3 matplotlib-3.3.4 mock-2.0.0 numpy-1.19.5 oauth2client-4.1.3 object-detection-0.1 opencv-python-4.5.1.48 opencv-python-headless-4.5.1.48 packaging-20.9 pandas-1.1.5 pbr-5.5.1 pillow-8.2.0 promise-2.3 proto-plus-1.18.1 psutil-5.8.0 py-cpuinfo-8.0.0 pyarrow-2.0.0 pycocotools-2.0.2 pydot-1.4.2 pymongo-3.11.3 pyparsing-2.4.7 python-dateutil-2.8.1 python-slugify-4.0.1 pytz-2021.1 pyyaml-5.4.1 scikit-learn-0.24.1 sentencepiece-0.1.95 seqeval-1.2.2 tensorflow-2.4.1 tensorflow-addons-0.12.1 tensorflow-datasets-4.2.0 tensorflow-estimator-2.4.0 tensorflow-hub-0.12.0 tensorflow-metadata-0.29.0 tensorflow-model-optimization-0.5.0 text-unidecode-1.3 tf-models-official-2.4.0 tf-slim-1.1.0 threadpoolctl-2.1.0 tqdm-4.60.0 typeguard-2.12.0 typing-extensions-3.7.4.3 uritemplate-3.0.1 wheel-0.36.2

6. System information

eirikaso commented 3 years ago

I'm now able to run training in the docker container after switching to tensorflow:2.4.1-gpu as dockerfile base. tensorflow:2.4.1 still gets installed when installing the python requirements though.

I also had to install opencv using apt as tensorflow was unable to find the version already installed in image by pip

apt-get install -y python-opencv

Something is buggy but at least I'm able to train now

Output before install opencv:

Traceback (most recent call last):
  File "object_detection/model_main_tf2.py", line 32, in <module>
    from object_detection import model_lib_v2
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/model_lib_v2.py", line 29, in <module>
    from object_detection import eval_util
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/eval_util.py", line 36, in <module>
    from object_detection.metrics import lvis_evaluation
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/metrics/lvis_evaluation.py", line 23, in <module>
    from lvis import results as lvis_results
  File "/home/tensorflow/.local/lib/python3.6/site-packages/lvis/__init__.py", line 5, in <module>
    from lvis.vis import LVISVis
  File "/home/tensorflow/.local/lib/python3.6/site-packages/lvis/vis.py", line 1, in <module>
    import cv2
  File "/home/tensorflow/.local/lib/python3.6/site-packages/cv2/__init__.py", line 5, in <module>
    from .cv2 import *
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
eirikaso commented 3 years ago

I'm now able to run training in the docker container after switching to tensorflow:2.4.1-gpu as dockerfile base. tensorflow:2.4.1 still gets installed when installing the python requirements though.

I also had to install opencv using apt as tensorflow was unable to find the version already installed in image by pip

apt-get install -y python-opencv

Something is buggy but at least I'm able to train now

Output before install opencv:

Traceback (most recent call last):
  File "object_detection/model_main_tf2.py", line 32, in <module>
    from object_detection import model_lib_v2
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/model_lib_v2.py", line 29, in <module>
    from object_detection import eval_util
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/eval_util.py", line 36, in <module>
    from object_detection.metrics import lvis_evaluation
  File "/home/tensorflow/.local/lib/python3.6/site-packages/object_detection/metrics/lvis_evaluation.py", line 23, in <module>
    from lvis import results as lvis_results
  File "/home/tensorflow/.local/lib/python3.6/site-packages/lvis/__init__.py", line 5, in <module>
    from lvis.vis import LVISVis
  File "/home/tensorflow/.local/lib/python3.6/site-packages/lvis/vis.py", line 1, in <module>
    import cv2
  File "/home/tensorflow/.local/lib/python3.6/site-packages/cv2/__init__.py", line 5, in <module>
    from .cv2 import *
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

It's not necessary to install the python-opencv using apt after all. It's sufficient to install

apt-get install libgl1-mesa-glx

training now runs on GPU. Both tensorflow and tensorflow-gpu is installed in the container

vscv commented 3 years ago

Thanks @eirikaso I have the same issue, in od container: $ pip uninstall tensorflow==2.4.1 to remove the duplicated.

jdorri commented 3 years ago

Hey @eirikaso! I recently ran into the same issue - quite annoying as it unnecessarily increases the size of the container and lengthens the build time.

The problem is that the object detection package doesn't recognise tensorflow-gpu (the package installed in the base Docker image) as a valid version of tensorflow, so it attempts to install it. Seems like the official Docker image has a GPU specific version of tensorflow, a property I thought was only true with TF 1.0.

A workaround is to create a symbolic link to trick pip into thinking tensorflow is already installed by adding this line to your Dockerfile:

WORKDIR /usr/local/lib/python3.6/dist-packages
RUN ln -s tensorflow_gpu-* tensorflow-$(ls -d1 tensorflow_gpu* | sed 's/tensorflow_gpu-\(.*\)/\1/')

Then, when you come to upgrading your base version of tensorflow, simply change the base image and rebuild the container without worrying about having two versions of tensorflow installed.

Here's my complete Dockerfile (works for me!):

## Custom Dockerfile for the Tensorflow Object Detection API ##

FROM tensorflow/tensorflow:2.4.1-gpu 

RUN python -c "import tensorflow as tf; print(f'Tensorflow version: {tf.__version__}')"

ARG DEBIAN_FRONTEND=noninteractive

# Install apt-get dependencies 
RUN apt-get update && apt-get install -y \
    python3-tk \
    libgl1-mesa-glx && rm -rf /var/lib/apt/lists/*

# Name the symlink with the suffix from tensorflow-gpu (see question 65098672: stackoverflow.com)
WORKDIR /usr/local/lib/python3.6/dist-packages
RUN ln -s tensorflow_gpu-* tensorflow-$(ls -d1 tensorflow_gpu* | sed 's/tensorflow_gpu-\(.*\)/\1/')

# Install protobuf 
RUN curl -L -O https://github.com/protocolbuffers/protobuf/releases/download/v3.11.4/protoc-3.11.4-linux-x86_64.zip && \
    unzip protoc-3.11.4-linux-x86_64.zip && \
    cp bin/protoc /usr/local/bin && \
    rm -r protoc-3.11.4-linux-x86_64.zip bin/

# Copy our local version of models into the image
WORKDIR home/tf
COPY . /home/tf/models

# Compile the protocol buffers for Python
RUN (cd /home/tf/models/research/ && protoc object_detection/protos/*.proto --python_out=.)

# Install the Object Detection API
WORKDIR /home/tf/models/research/
RUN cp object_detection/packages/tf2/setup.py .

RUN python -m pip install --upgrade pip
RUN python -m pip install .

# Confirm tensorflow hasn't been reinstalled
RUN python -c "import tensorflow as tf; print(f'Tensorflow version: {tf.__version__}')"

# Add models to our python path
ENV PYTHONPATH="/home/tf/models:$PYTHONPATH" 
tensorbuffer commented 3 years ago

this is very helpful. I had the same issue just by import tensorflow inside docker (undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb), and I see there are both tensorflow 2.6.0 and tensorflow-gpu 2.4.1 (I changed the base image in docker file to 2.4.1). I wonder where it says to install 2.6.0, just find the latest in tensorflow?

eirikaso commented 3 years ago

The latest version seems to be installed as a dependency when the different python packages are being installed from a requirements.txt file. Since versions are not specified in this file, I think one of the packages depends on tensorflow, so it installs the latest package

Niccari commented 3 years ago

As for the workaround, I was able to align the tensorflow-gpu and tensorflow versions by rewriting the Dockerfile as shown below. The object_detection library should be versioned for each TensorFlow version as well as the tf_models_official library, IMO.

FROM tensorflow/tensorflow:2.5.0-gpu
ARG DEBIAN_FRONTEND=noninteractive

# Install apt dependencies
RUN apt-get update && apt-get install -y \
    git \
    gpg-agent \
    python3-cairocffi \
    protobuf-compiler \
    python3-pil \
    python3-lxml \
    python3-tk \
    wget

# Add new user to avoid running as root
RUN useradd -ms /bin/bash tensorflow
USER tensorflow
WORKDIR /home/tensorflow

# Clone Object Detection API
RUN git clone https://github.com/tensorflow/models/ /home/tensorflow/models/

# Workaround: If you use TF 2.2.x, uncomment the line below.
# WORKDIR /home/tensorflow/models/
# RUN git checkout 03a6d6c8e79b426231a4d5ba0cf45be9afc8bad5

# Workaround: If you use TF 2.3.x, uncomment the line below.
# WORKDIR /home/tensorflow/models/
# RUN git checkout cf82a72480a41a62b4bbe0f1378d319f0d6f5d5c

# Compile protobuf configs
RUN (cd /home/tensorflow/models/research/ && protoc object_detection/protos/*.proto --python_out=.)
WORKDIR /home/tensorflow/models/research/

RUN cp object_detection/packages/tf2/setup.py ./
ENV PATH="/home/tensorflow/.local/bin:${PATH}"

# Workaround (For Tensorflow < 2.5.1): Remove tf-models-official dependency from object_detection, will install it manually.
RUN sed -i -e 's/^.*tf-models-official.*$//g' ./setup.py

RUN python -m pip install -U pip

# Workaround: Lock tensorflow and corresponding tf-models-official versions.
RUN python -m pip install tensorflow==2.5.0 tensorflow-text==2.5.0 tf-models-official==2.5.0
RUN python -m pip install .

ENV TF_CPP_MIN_LOG_LEVEL 3

The changes are as follows.

  1. Manually install the versions of tensorflow and tf-models-official with the same version as tensorflow-gpu. Specify the version of the tensorflow library so that the installed tensorflow library will be referenced.
RUN python -m pip install -U pip
+ 
+ # Workaround: install tensorflow and corresponding tf-models-official versions.
+ RUN python -m pip install tensorflow==2.5.0 tensorflow-text==2.5.0 tf-models-official==2.5.0
RUN python -m pip install .

ENV TF_CPP_MIN_LOG_LEVEL 3

tf-models-official never pin tensorflow version. tensorflow/tensorflow:x.x.x-gpu does not include the tensorflow library by default. Therefore, if no version of tensorflow is specified, the latest version will be installed.

  1. If tensorflow-gpu is < 2.5.1, remove tf-models-official from the dependencies of the object_detection library because the latest object_detection specifies a dependency on tf-models-official >= 2.5.1.
ENV PATH="/home/tensorflow/.local/bin:${PATH}"

+ # Workaround (For tensorflow < 2.5.1): Remove tf-models-official dependency from object_detection, will install it manually.
+ RUN sed -i -e 's/^.*tf-models-official.*$//g' ./setup.py

RUN python -m pip install -U pip
  1. If tensorflow-gpu is < 2.4.0, clone older object_detection package.
# Clone Object Detection API
RUN git clone https://github.com/tensorflow/models/ /home/tensorflow/models/
+ 
+ # Workaround: If you use TF 2.2.x, uncomment the line below.
+ # WORKDIR /home/tensorflow/models/
+ # RUN git checkout 03a6d6c8e79b426231a4d5ba0cf45be9afc8bad5
+ 
+ # Workaround: If you use TF 2.3.x, uncomment the line below.
+ # WORKDIR /home/tensorflow/models/
+ # RUN git checkout cf82a72480a41a62b4bbe0f1378d319f0d6f5d5c

# Compile protobuf configs

Installed library list (summarized, in TF 2.5.0)

$ pip list

...
object-detection              0.1
...
tensorboard                   2.5.0
tensorboard-data-server       0.6.1
tensorboard-plugin-wit        1.8.0
tensorflow                    2.5.0
tensorflow-addons             0.14.0
tensorflow-datasets           4.4.0
tensorflow-estimator          2.5.0rc0
tensorflow-hub                0.12.0
tensorflow-metadata           1.2.0
tensorflow-model-optimization 0.6.0
tensorflow-text               2.5.0
termcolor                     1.1.0
text-unidecode                1.3
tf-models-official            2.5.0
...

Confirmed TensorFlow versions

thmsgntz commented 2 years ago

Hello, @Niccari's solution works fine on my side! Thanks for your help. Last thing I would need is to freeze the version of the cloned repository tensorflow/model. To avoid update on the master branch that may break the installation. In this line :

RUN git clone https://github.com/tensorflow/models/ /home/tensorflow/models/

The best thing would be something like:

RUN git clone --branch v2.5.0 --depth 1 https://github.com/tensorflow/models/ /home/tensorflow/models/

BUT research directory under models are removed from the released versions (see the last commit "Removing research/community models" version 2.5.0 or 2.4.0). I tried to recover from a previous commit on these branches, but it contains a really old version of research directory, around 15 months old and of course, this old version of object detection does not support TF2 (see for example this commit).

Is there a way to achieve this?

Thanks a lot.