Exception in "make test-workflow" in "ocrd_all" environment

stefanCCS commented 1 year ago

After installing ocrd_all Rel. v2023-06-14 I have called make test-workflow. This leads to following error:

make -C core assets
make[1]: Entering directory '/home/gputest/ocrd_all/core'
git submodule sync --recursive repo/assets
if git submodule status --recursive repo/assets | grep -qv '^ '; then \
        git submodule update --init --recursive repo/assets && \
        touch repo/assets; \
fi
Submodule 'repo/assets' (https://github.com/OCR-D/assets) registered for path 'repo/assets'
Cloning into '/home/gputest/ocrd_all/core/repo/assets'...
Submodule path 'repo/assets': checked out 'bcfa982e81319513a13ae58ab2c216e014e52bd7'
rm -rf tests/assets
mkdir -p tests/assets
cp -r repo/assets/data/* tests/assets
make[1]: Leaving directory '/home/gputest/ocrd_all/core'
sem -q --will-cite --fg --id ocrd_all_git git submodule sync  core
Synchronizing submodule url for 'core'
if git submodule status  core | grep -qv '^ '; then \
        sem -q --will-cite --fg --id ocrd_all_git git submodule update --init   core && \
        touch core; fi
. /home/gputest/ocrd-3.8/bin/activate && cd core/tests/assets/SBB0000F29300010000/data/ && bash -x /home/gputest/ocrd_all/test-workflow.sh
+ set -e
+ ocrd resmgr download ocrd-sbb-binarize default-2021-03-09
2023-06-23 11:27:22.645 INFO ocrd.cli.resmgr - Downloading registered resource 'default-2021-03-09' (https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip)

2023-06-23 11:27:24.921 INFO ocrd.resource_manager._download_impl - Downloading https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip to download.tar.xx
2023-06-23 11:27:28.672 INFO ocrd.resource_manager.download - Extracting application/zip archive to /tmp/tmpc1mfe502/out
2023-06-23 11:27:29.452 INFO ocrd.resource_manager.download - Copying '.' from archive to /home/gputest/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09
2023-06-23 11:27:29.540 INFO ocrd.cli.resmgr - Installed resource https://github.com/qurator-spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip under /home/gputest/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09
2023-06-23 11:27:29.540 INFO ocrd.cli.resmgr - Use in parameters as 'default-2021-03-09'
+ ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default-2021-03-09
2023-06-23 11:27:38.592 INFO processor.SbbBinarize - INPUT FILE 0 / PHYS_0001
2023-06-23 11:27:39.011 INFO processor.SbbBinarize - Binarizing on 'page' level in page 'PHYS_0001'
2023-06-23 11:27:39.052 INFO processor.SbbBinarize.__init__ - Predicting with model /home/gputest/.local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09/saved_model_2021_03_09/ [1/1]
2023-06-23 11:27:40.975 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-sbb-binarize'
Traceback (most recent call last):
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/sbb_binarize/ocrd_cli.py", line 113, in process
    bin_image = cv2pil(self.binarizer.run(image=pil2cv(page_image)))
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/sbb_binarize/sbb_binarize.py", line 244, in run
    res = self.predict(model, image)
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/sbb_binarize/sbb_binarize.py", line 157, in predict
    label_p_pred = model.predict(img_patch.reshape(1, img_patch.shape[0], img_patch.shape[1], img_patch.shape[2]),
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'model_2/conv1/Conv2D' defined at (most recent call last):
    File "/home/gputest/ocrd-3.8/bin/ocrd-sbb-binarize", line 8, in <module>
      sys.exit(cli())
    File "/home/gputest/ocrd-3.8/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
      return self.main(*args, **kwargs)
 ...

--> please clarify ...

bertsky commented 1 year ago

That's odd. It's probably caused by a wrong installation of Tensorflow or its (implicit) dependencies (esp. libcudnn). I cannot reproduce though.

Could you please show the results of (in your active venv):

pip show torch
pip show tensorflow
pip show nvidia-cudnn-cu11
ldconfig -p | grep cudnn

(If you are on ocrd_all in a native installation, perhaps you need to run make fix-cuda, as is currently used in the Docker build. See respective comments in the Makefile for explanation.)

stefanCCS commented 1 year ago

Here are the results:

(ocrd-3.8) gputest@linuxgputest2:~/ocrd_all$ pip show torch
Name: torch
Version: 1.13.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/gputest/ocrd-3.8/lib/python3.8/site-packages
Requires: nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, typing-extensions
Required-by: kraken, ocrd-anybaseocr, ocrd-detectron2, ocrd-typegroups-classifier, pix2pixhd, pytorch-lightning, torchmetrics, torchvision
(ocrd-3.8) gputest@linuxgputest2:~/ocrd_all$ pip show tensorflow
Name: tensorflow
Version: 2.12.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /home/gputest/ocrd-3.8/lib/python3.8/site-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, jax, keras, libclang, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt
Required-by: calamari-ocr, eynollah, ocrd-anybaseocr, ocrd-calamari, sbb-binarization
(ocrd-3.8) gputest@linuxgputest2:~/ocrd_all$ pip show nvidia-cudnn-cu11
Name: nvidia-cudnn-cu11
Version: 8.5.0.96
Summary: cuDNN runtime libraries
Home-page: https://developer.nvidia.com/cuda-zone
Author: Nvidia CUDA Installer Team
Author-email: cuda_installer@nvidia.com
License: NVIDIA Proprietary Software
Location: /home/gputest/ocrd-3.8/lib/python3.8/site-packages
Requires: nvidia-cublas-cu11
Required-by: torch
(ocrd-3.8) gputest@linuxgputest2:~/ocrd_all$ ldconfig -p | grep cudnn
        libcudnn_ops_train.so.8 (libc6,x86-64) => /conda/lib/libcudnn_ops_train.so.8
        libcudnn_ops_train.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_ops_train.so.8
        libcudnn_ops_train.so (libc6,x86-64) => /conda/lib/libcudnn_ops_train.so
        libcudnn_ops_train.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_ops_train.so
        libcudnn_ops_infer.so.8 (libc6,x86-64) => /conda/lib/libcudnn_ops_infer.so.8
        libcudnn_ops_infer.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8
        libcudnn_ops_infer.so (libc6,x86-64) => /conda/lib/libcudnn_ops_infer.so
        libcudnn_ops_infer.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_ops_infer.so
        libcudnn_cnn_train.so.8 (libc6,x86-64) => /conda/lib/libcudnn_cnn_train.so.8
        libcudnn_cnn_train.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8
        libcudnn_cnn_train.so (libc6,x86-64) => /conda/lib/libcudnn_cnn_train.so
        libcudnn_cnn_train.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_cnn_train.so
        libcudnn_cnn_infer.so.8 (libc6,x86-64) => /conda/lib/libcudnn_cnn_infer.so.8
        libcudnn_cnn_infer.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8
        libcudnn_cnn_infer.so (libc6,x86-64) => /conda/lib/libcudnn_cnn_infer.so
        libcudnn_cnn_infer.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so
        libcudnn_adv_train.so.8 (libc6,x86-64) => /conda/lib/libcudnn_adv_train.so.8
        libcudnn_adv_train.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_adv_train.so.8
        libcudnn_adv_train.so (libc6,x86-64) => /conda/lib/libcudnn_adv_train.so
        libcudnn_adv_train.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_adv_train.so
        libcudnn_adv_infer.so.8 (libc6,x86-64) => /conda/lib/libcudnn_adv_infer.so.8
        libcudnn_adv_infer.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8
        libcudnn_adv_infer.so (libc6,x86-64) => /conda/lib/libcudnn_adv_infer.so
        libcudnn_adv_infer.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn_adv_infer.so
        libcudnn.so.8 (libc6,x86-64) => /conda/lib/libcudnn.so.8
        libcudnn.so.8 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn.so.8
        libcudnn.so (libc6,x86-64) => /conda/lib/libcudnn.so
        libcudnn.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudnn.so

I have NOT run make fix-cuda so far - should I do this now? (or any other recommendation based on the output from above?

bertsky commented 1 year ago

I have NOT run make fix-cuda so far - should I do this now? (or any other recommendation based on the output from above?

Yes, please do. (TF 2.12 needs cudnn 8.6, but Torch 1.13 via ocrd_kraken pulled 8.5. We could also rerun the ocrd_detectron2 setup so we get Torch 2.0.1 and cudnn 8.6, but make fix-cuda is probably the easiest and safest ATM.)

stefanCCS commented 1 year ago

Looks good - make test-workflow has run through. I only see some ERRORs for EvaluateLines like this:

2023-06-26 09:31:23.443 ERROR processor.KrakenSegment - Line 89 could not be assigned a region, creating a dummy region
2023-06-26 09:40:48.800 ERROR processor.EvaluateLines - Line 'region_0006_line_0001' contains too short word/glyph sequence (9<10)
2023-06-26 09:40:50.298 ERROR processor.EvaluateLines - line "region_0017_line_0002" in file "OCR-D-OCR4_PHYS_0001" is missing from input 2

--> I assume, this is "just" something, which may happen in EvaluateLines-processing depending on content/image. Therefore, software is ok. Correct?

bertsky commented 1 year ago

Yes, that happens all the time. The error is a result of Kraken's internal architecture. Hard to tell whether these are legitimate segmentation problems. But if test-workflow completes, you are fine.

stefanCCS commented 1 year ago

I will close this issue now. Please tell me, if I should re-open for tracking a documentation issue to put somewhere the need of make fix-cuda.

qurator-spk / sbb_binarization

Exception in "make test-workflow" in "ocrd_all" environment #62