ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.03k stars 1.01k forks source link

NixOS packaging issues #202

Closed sjau closed 6 years ago

sjau commented 6 years ago

Hi there

I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.

Anyway, I do get this error when it's trying to build OCRmyPDF:

building path(s) ‘/nix/store/kdpr7qaz85lrls5mwqyvgrfi5v811i5q-ORCmyPDF-5.4.3’
unpacking sources
unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source
source root is source
setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py
patching sources
configuring
building
Skipping external program tests because of --force
Traceback (most recent call last):
  File "nix_run_setup.py", line 8, in <module>
    exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
  File "setup.py", line 245, in <module>
    zip_safe=False)
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup
    _setup_distribution = dist = klass(attrs)
  File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in __init__
    _Distribution.__init__(self, attrs)
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in __init__
    self.finalize_options()
  File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options
    ep.load()(self, ep.name, value)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
    add_cffi_module(dist, cffi_module)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
    execfile(build_file_name, mod_vars)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
    src = f.read()
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
builder for ‘/nix/store/jsnfzz199dy49viv14l1is2i1d2r3lq9-ORCmyPDF-5.4.3.drv’ failed with exit code 1
cannot build derivation ‘/nix/store/niq3y1rw30sqx5gp5jwrd273hlv6xhb2-system-path.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
error: build of ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed

The current nix expression that I use to try to build it looks like:

{ lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:

with python3.pkgs;

let

  ruffus = callPackage ("/tankJL/opt/ruffus.nix") {};
  img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};

in

buildPythonApplication rec {
  version = "5.4.3";
  name = "ORCmyPDF-${version}";

  src = fetchFromGitHub {
    owner = "jbarlow83";
    repo = "OCRmyPDF";
    rev = version;
    sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf";
  };

  postPatch = ''
    substituteInPlace requirements.txt \
      --replace "ruffus == 2.6.3" "ruffus" \
      --replace "Pillow == 4.3.0" "Pillow" \
      --replace "reportlab == 3.4.0" "reportlab" \
      --replace "PyPDF2 == 1.26.0" "PyPDF2" \
      --replace "img2pdf == 0.2.4" "img2pdf" \
      --replace "cffi == 1.11.2" "cffi"
    substituteInPlace test_requirements.txt \
      --replace "pytest >= 3.0" "pytest"
    export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
  '';

  buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];

  propagatedBuildInputs = [
    ruffus
    pillow
    reportlab
    pypdf2
    img2pdf
    cffi
    unpaper
    ghostscript
    tesseract
    qpdf
  ];

  meta = {
    homepage = https://github.com/jbarlow83/OCRmyPDF;
    description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.";
    license = lib.licenses.mit;
    maintainers = with lib.maintainers; [ hyper_ch ];
  };
}

I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.

jbarlow83 commented 6 years ago

Never heard of NixOS but it looks interesting.

You'll need to set the locale to UTF8, e.g. environmemt variable LANG=C.utf-8

Probably for both set up and execution. Python 3 does not work well when the system locale is not Unicode aware.

On Nov 24, 2017 12:07 PM, "sjau" notifications@github.com wrote:

Hi there

I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.

Anyway, I do get this error when it's trying to build OCRmyPDF:

building path(s) ‘/nix/store/kdpr7qaz85lrls5mwqyvgrfi5v811i5q-ORCmyPDF-5.4.3’ unpacking sources unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source source root is source setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py patching sources configuring building Skipping external program tests because of --force Traceback (most recent call last): File "nix_run_setup.py", line 8, in exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec')) File "setup.py", line 245, in zip_safe=False) File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup _setup_distribution = dist = klass(attrs) File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in init _Distribution.init(self, attrs) File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in init self.finalize_options() File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options ep.load()(self, ep.name, value) File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules add_cffi_module(dist, cffi_module) File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module execfile(build_file_name, mod_vars) File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile src = f.read() File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128) builder for ‘/nix/store/jsnfzz199dy49viv14l1is2i1d2r3lq9-ORCmyPDF-5.4.3.drv’ failed with exit code 1 cannot build derivation ‘/nix/store/niq3y1rw30sqx5gp5jwrd273hlv6xhb2-system-path.drv’: 1 dependencies couldn't be built cannot build derivation ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built error: build of ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed

The current nix expression that I use to try to build it looks like:

{ lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:

with python3.pkgs;

let

ruffus = callPackage ("/tankJL/opt/ruffus.nix") {}; img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};

in

buildPythonApplication rec { version = "5.4.3"; name = "ORCmyPDF-${version}";

src = fetchFromGitHub { owner = "jbarlow83"; repo = "OCRmyPDF"; rev = version; sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf"; };

postPatch = '' substituteInPlace requirements.txt \ --replace "ruffus == 2.6.3" "ruffus" \ --replace "Pillow == 4.3.0" "Pillow" \ --replace "reportlab == 3.4.0" "reportlab" \ --replace "PyPDF2 == 1.26.0" "PyPDF2" \ --replace "img2pdf == 0.2.4" "img2pdf" \ --replace "cffi == 1.11.2" "cffi" substituteInPlace test_requirements.txt \ --replace "pytest >= 3.0" "pytest" export SETUPTOOLS_SCM_PRETEND_VERSION="${version}" '';

buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];

propagatedBuildInputs = [ ruffus pillow reportlab pypdf2 img2pdf cffi unpaper ghostscript tesseract qpdf ];

meta = { homepage = https://github.com/jbarlow83/OCRmyPDF; description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted."; license = lib.licenses.mit; maintainers = with lib.maintainers; [ hyper_ch ]; }; }

I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/202, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcM1sjP_aQb_Kfcvx9UyvgeqtvJnksks5s5yHugaJpZM4QqJ7N .

sjau commented 6 years ago

Hi thanks for the quick answer. I tried setting now lang accordingly to your suggestion, however nothing has changed:

building path(s) ‘/nix/store/afxj0mmia4rnjhimksn8cx7hlizdnj3c-ORCmyPDF-5.4.3’
unpacking sources
unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source
source root is source
setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py
patching sources
-------------------------------------------

C.utf-8
-------------------------------------------
configuring
building
Skipping external program tests because of --force
Traceback (most recent call last):
  File "nix_run_setup.py", line 8, in <module>
    exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
  File "setup.py", line 245, in <module>
    zip_safe=False)
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup
    _setup_distribution = dist = klass(attrs)
  File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in __init__
    _Distribution.__init__(self, attrs)
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in __init__
    self.finalize_options()
  File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options
    ep.load()(self, ep.name, value)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
    add_cffi_module(dist, cffi_module)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
    execfile(build_file_name, mod_vars)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
    src = f.read()
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
builder for ‘/nix/store/6mvi91imkrlb2dbgqil0zx2h2m9avz0y-ORCmyPDF-5.4.3.drv’ failed with exit code 1
cannot build derivation ‘/nix/store/wncx82hszxhmpcxwgh3zm57d65dsf7iy-system-path.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/5f1np8np28rrx8bandg711pcbrs9470a-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
error: build of ‘/nix/store/5f1np8np28rrx8bandg711pcbrs9470a-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed
{ lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:

with python3.pkgs;

let

  ruffus = callPackage ("/tankJL/opt/ruffus.nix") {};
  img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};

in

buildPythonApplication rec {
  version = "5.4.3";
  name = "ORCmyPDF-${version}";

  src = fetchFromGitHub {
    owner = "jbarlow83";
    repo = "OCRmyPDF";
    rev = version;
    sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf";
  };

  postPatch = ''
    substituteInPlace requirements.txt \
      --replace "ruffus == 2.6.3" "ruffus" \
      --replace "Pillow == 4.3.0" "Pillow" \
      --replace "reportlab == 3.4.0" "reportlab" \
      --replace "PyPDF2 == 1.26.0" "PyPDF2" \
      --replace "img2pdf == 0.2.4" "img2pdf" \
      --replace "cffi == 1.11.2" "cffi"
    substituteInPlace test_requirements.txt \
      --replace "pytest >= 3.0" "pytest"
    export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
    echo "-------------------------------------------"
    echo $LANG
    export LANG="C.utf-8"
    echo $LANG
    echo "-------------------------------------------"
  '';

  buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];

  propagatedBuildInputs = [
    ruffus
    pillow
    reportlab
    pypdf2
    img2pdf
    cffi
    unpaper
    ghostscript
    tesseract
    qpdf
  ];

  meta = {
    homepage = https://github.com/jbarlow83/OCRmyPDF;
    description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.";
    license = lib.licenses.mit;
    maintainers = with lib.maintainers; [ hyper_ch ];
  };
}

I also tried en_US.UTF-8 with the same result.

jbarlow83 commented 6 years ago

It's possible your package manager changes the LANG environment in a child process and then forks a different child to run Python. See if python3 -c 'import os; print(os.environ['LANG'])' actually reflects the change in locale.

Although maybe there's an explicit command to tell NixOS it needs a UTF-8 locale? I suspect other Py3 packages have dealt with this. The underlying issue is being fixed Python-wide for 3.7.

If you want to try hacking around it, it looks like it is choking on the utf-8 copyright symbol in ocrmypdf/lib/compile_leptonica.py, so you could replace that so that one script would execute and setup would likely complete. However ocrmypdf still needs a utf-8 locale to do its work, so it would probably fail the test suite if its execution environment is not truly utf-8.

jbarlow83 commented 6 years ago

When you do have it working please submit a PR to update the docs with installation instructions.

sjau commented 6 years ago

Hi

Thanks for the feedback.... I did now this:

    echo "-------------------------------------------"
    echo $LANG
#    python3 -c 'import os; print(os.environ['LANG'])'
    python -c 'import sys; print(sys.getdefaultencoding())'
    export LANG="en_US.UTF-8"
#    python3 -c 'import os; print(os.environ['LANG'])'
    python -c 'import sys; print(sys.getdefaultencoding())'
    echo $LANG
    echo "-------------------------------------------

Yours outputted errors, I also tried to change the quotes but couldn't get it to work but on the internet I found the getdefaultencoding variant. I don't know if there's a difference between them.

Anyway, here's the output:

-------------------------------------------

utf-8
utf-8
en_US.UTF-8
-------------------------------------------

Before explicitely setting the env var for LANG it doesn't contain any value. However the getdefaultencoding variant still outputs utf-8. After settings the LANG env, it outputs still utf-8 as getdefaultencoding and en_US.UTF-8 for the LANG.

Also, how did you figure out the problem likely resides in the compile_leptonica.py file?

sjau commented 6 years ago

Yes, the problem was the copyright sign. Having added to the postPatch process:

    substituteInPlace ocrmypdf/lib/compile_leptonica.py \
      --replace "©" "(c)"

it outputted a different error now....

running install tests
Checking for tesseract >= 3.04...
Found tesseract 3.05.00
Checking for gs >= 9.15...
Found gs 9.20
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.1.1...
Found qpdf 7.0.0
usage: nix_run_setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: nix_run_setup.py --help [cmd1 cmd2 ...]
   or: nix_run_setup.py --help-commands
   or: nix_run_setup.py cmd --help

error: invalid command 'pytest'
builder for ‘/nix/store/hy1fdjchpd25mggdk4pf6q0c0lvbyd6x-ORCmyPDF-5.4.3.drv’ failed with exit code 1
cannot build derivation ‘/nix/store/8kj6lik5cpx0archdqcn4m1zm6an7z6a-system-path.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/5zv7kr2mnza8bpdia7yj44qlxk91m6lx-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built

Let me see if I can get that to work :)

So, since I "fixed" it I think this bug report can be closed. For me there doesn't seem to be a need to replaced the copyright sign in your source.

jbarlow83 commented 6 years ago

error: invalid command 'pytest'

This can be fixed by installing pytest-runner prior to trying to run the command python3 setup.py test.

Strictly speaking pytest-runner should be added to setup_requires in setup.py. I will add that but you can do it explicitly too.

https://docs.pytest.org/en/latest/goodpractices.html#integrating-with-setuptools-python-setup-py-test-pytest-runner

Watch out for the import order here. Are you running the tests before or after install? It's probably better to install, mv ocrmypdf src in the directory that contains setup.py, and then run the test suite. This will ensure the installed version gets picked up instead of the uninstalled files in the local folder. (It turns that it's a better practice to have a top level src/ folder to avoid this problem, at least for packages like mine that can't be tested before installation, but I don't want to change it until a major release since it will inconvenience all downstream maintainers.)

It might be helpful to look at what the Debian maintainer did. Some of these problems are similar. https://git.spwhitton.name/ocrmypdf/tree/debian?h=debian


Also, how did you figure out the problem likely resides in the compile_leptonica.py file?

  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
    add_cffi_module(dist, cffi_module)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
    execfile(build_file_name, mod_vars)
  File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
    src = f.read()
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
    src = f.read()
  File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

The clues in the stack trace were: this has to do with building cffi modules and I only have one of those, leptonica. I looked up the source for cffi/setuptools_ext.py and this is where it executes the package's compile script.

It failed at position 2. What's a byte offset 2 in that actual file? Non-ASCII symbol.

sjau commented 6 years ago

Ok, adding pytest-runner helps. It wasn't listed here: https://github.com/jbarlow83/OCRmyPDF/blob/master/test_requirements.txt :)

As for the order: I don't know for sure. It seems the tests are run before installation - I get tons of AssertionErrors

``` env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--force-ocr', '--pdf-renderer', 'tesseract') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError _____________________ test_tesseract_config_notfound[hocr] _____________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m renderer = 'hocr' resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outdir = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_tesseract_config_notfound0') @pytest.mark.parametrize('renderer', RENDERERS) def test_tesseract_config_notfound(renderer, resources, outdir): cfg_file = outdir / 'nofile.cfg' p, out, err = run_ocrmypdf( resources / 'ccitt.pdf', outdir / 'out.pdf', '--pdf-renderer', renderer, '--tesseract-config', cfg_file) > assert "Can't open" in err, "No error message about missing config file" E AssertionError: No error message about missing config file E assert "Can't open" in 'Traceback (most recent call last):\n File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/r...o suitable UTF-8\nlocales were discovered. This most likely requires resolving\nby reconfiguring the locale system.\n' tests/test_main.py:773: AssertionError _____________________ test_pagesize_consistency[tesseract] _____________________ [gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m renderer = 'tesseract' resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_pagesize_consistency_tess0/out.pdf' @pytest.mark.parametrize('renderer', RENDERERS) def test_pagesize_consistency(renderer, resources, outpdf): from math import isclose first_page_dimensions = pytest.helpers.first_page_dimensions infile = resources / 'linn.pdf' before_dims = first_page_dimensions(infile) check_ocrmypdf( infile, outpdf, '--pdf-renderer', renderer, > '--clean', '--deskew', '--remove-background', '--clean-final') tests/test_main.py:843: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/linn.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_pagesize_consistency_tess0/out.pdf' env = None args = ('--pdf-renderer', 'tesseract', '--clean', '--deskew', '--remove-background', '--clean-final') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError _______________________________ test_user_words ________________________________ [gw0] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outdir = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_user_words0') def test_user_words(resources, outdir): word_list = outdir / 'wordlist.txt' sidecar_before = outdir / 'sidecar_before.txt' sidecar_after = outdir / 'sidecar_after.txt' # Don't know how to make this test pass on various versions and platforms # so weaken to merely testing that the argument is accepted consistent = False if consistent: check_ocrmypdf( resources / 'crom.png', outdir / 'out.pdf', '--image-dpi', 150, '--sidecar', sidecar_before ) assert 'cromulent' not in sidecar_before.open().read() with word_list.open('w') as f: f.write('cromulent\n') # a perfectly cromulent word check_ocrmypdf( resources / 'crom.png', outdir / 'out.pdf', '--image-dpi', 150, '--sidecar', sidecar_after, > '--user-words', word_list ) tests/test_main.py:817: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/crom.png') output_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_user_words0/out.pdf') env = None args = ('--image-dpi', 150, '--sidecar', PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/...', PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_user_words0/wordlist.txt')) p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError ____________________________ test_gs_raster_failure ____________________________ [gw2] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_no_tess_gs_raster_fail = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_gs_raster_failure0/out.pdf' def test_gs_raster_failure(spoof_no_tess_gs_raster_fail, resources, outpdf): p, out, err = run_ocrmypdf( resources / 'ccitt.pdf', outpdf, env=spoof_no_tess_gs_raster_fail) print(err) > assert p.returncode == ExitCode.child_process_error E assert 1 == E + where 1 = .returncode E + and = ExitCode.child_process_error tests/test_main.py:871: AssertionError ----------------------------- Captured stdout call ----------------------------- Traceback (most recent call last): File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in _unicodefun._verify_python3_env() File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env 'environment.' + extra) RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Additional information: on this system no suitable UTF-8 locales were discovered. This most likely requires resolving by reconfiguring the locale system. __________________ test_tesseract_config_notfound[tesseract] ___________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m renderer = 'tesseract' resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outdir = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_tesseract_config_notfound1') @pytest.mark.parametrize('renderer', RENDERERS) def test_tesseract_config_notfound(renderer, resources, outdir): cfg_file = outdir / 'nofile.cfg' p, out, err = run_ocrmypdf( resources / 'ccitt.pdf', outdir / 'out.pdf', '--pdf-renderer', renderer, '--tesseract-config', cfg_file) > assert "Can't open" in err, "No error message about missing config file" E AssertionError: No error message about missing config file E assert "Can't open" in 'Traceback (most recent call last):\n File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/r...o suitable UTF-8\nlocales were discovered. This most likely requires resolving\nby reconfiguring the locale system.\n' tests/test_main.py:773: AssertionError _________________________ test_skip_big_with_no_images _________________________ [gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_skip_big_with_no_images0/out.pdf' def test_skip_big_with_no_images(spoof_tesseract_noop, resources, outpdf): check_ocrmypdf(resources / 'blank.pdf', outpdf, '--skip-big', '5', '--force-ocr', > env=spoof_tesseract_noop) tests/test_main.py:855: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/blank.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_skip_big_with_no_images0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--skip-big', '5', '--force-ocr') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError ______________________________ test_form_xobject _______________________________ [gw0] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_form_xobject0/out.pdf' def test_form_xobject(spoof_tesseract_noop, resources, outpdf): check_ocrmypdf(resources / 'formxobject.pdf', outpdf, '--force-ocr', > env=spoof_tesseract_noop) tests/test_main.py:827: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/formxobject.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_form_xobject0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--force-ocr',), p = out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError _______________________________ test_no_contents _______________________________ [gw2] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_no_contents0/out.pdf' def test_no_contents(spoof_tesseract_noop, resources, outpdf): check_ocrmypdf(resources / 'no_contents.pdf', outpdf, '--force-ocr', > env=spoof_tesseract_noop) tests/test_main.py:876: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/no_contents.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_no_contents0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--force-ocr',), p = out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError _____________________ test_tesseract_config_invalid[hocr] ______________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m renderer = 'hocr' resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outdir = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_tesseract_config_invalid_0') @pytest.mark.parametrize('renderer', RENDERERS) def test_tesseract_config_invalid(renderer, resources, outdir): cfg_file = outdir / 'test.cfg' with cfg_file.open('w') as f: f.write('''\ THIS FILE IS INVALID ''') p, out, err = run_ocrmypdf( resources / 'ccitt.pdf', outdir / 'out.pdf', '--pdf-renderer', renderer, '--tesseract-config', cfg_file) > assert "parameter not found" in err, "No error message" E AssertionError: No error message E assert 'parameter not found' in 'Traceback (most recent call last):\n File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/r...o suitable UTF-8\nlocales were discovered. This most likely requires resolving\nby reconfiguring the locale system.\n' tests/test_main.py:789: AssertionError ____________________________ test_gs_render_failure ____________________________ [gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_no_tess_gs_render_fail = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_gs_render_failure0/out.pdf' def test_gs_render_failure(spoof_no_tess_gs_render_fail, resources, outpdf): p, out, err = run_ocrmypdf( resources / 'blank.pdf', outpdf, env=spoof_no_tess_gs_render_fail) print(err) > assert p.returncode == ExitCode.child_process_error E assert 1 == E + where 1 = .returncode E + and = ExitCode.child_process_error tests/test_main.py:863: AssertionError ----------------------------- Captured stdout call ----------------------------- Traceback (most recent call last): File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in _unicodefun._verify_python3_env() File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env 'environment.' + extra) RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Additional information: on this system no suitable UTF-8 locales were discovered. This most likely requires resolving by reconfiguring the locale system. _______________________ test_pagesize_consistency[hocr] ________________________ [gw0] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m renderer = 'hocr' resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_pagesize_consistency_hocr0/out.pdf' @pytest.mark.parametrize('renderer', RENDERERS) def test_pagesize_consistency(renderer, resources, outpdf): from math import isclose first_page_dimensions = pytest.helpers.first_page_dimensions infile = resources / 'linn.pdf' before_dims = first_page_dimensions(infile) check_ocrmypdf( infile, outpdf, '--pdf-renderer', renderer, > '--clean', '--deskew', '--remove-background', '--clean-final') tests/test_main.py:843: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/linn.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_pagesize_consistency_hocr0/out.pdf' env = None args = ('--pdf-renderer', 'hocr', '--clean', '--deskew', '--remove-background', '--clean-final') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError ____________________ test_compression_preserved[baiona.png] ____________________ [gw2] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} ocrmypdf_exec = ['/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m', '-m', 'ocrmypdf'] resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') image = 'baiona.png' outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_compression_preserved_bai0/out.pdf' @pytest.mark.parametrize('image', [ 'baiona.png', 'baiona_gray.png', 'congress.jpg' ]) def test_compression_preserved(spoof_tesseract_noop, ocrmypdf_exec, resources, image, outpdf): from PIL import Image input_file = str(resources / image) output_file = str(outpdf) im = Image.open(input_file) # Runs: ocrmypdf - output.pdf < testfile with open(input_file, 'rb') as input_stream: p_args = ocrmypdf_exec + [ '--image-dpi', '150', '--output-type', 'pdf', '-', output_file] p = Popen( p_args, close_fds=True, stdout=PIPE, stderr=PIPE, stdin=input_stream, env=spoof_tesseract_noop) out, err = p.communicate() > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_main.py:902: AssertionError ___________________ test_tesseract_config_invalid[tesseract] ___________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m renderer = 'tesseract' resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outdir = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_tesseract_config_invalid_1') @pytest.mark.parametrize('renderer', RENDERERS) def test_tesseract_config_invalid(renderer, resources, outdir): cfg_file = outdir / 'test.cfg' with cfg_file.open('w') as f: f.write('''\ THIS FILE IS INVALID ''') p, out, err = run_ocrmypdf( resources / 'ccitt.pdf', outdir / 'out.pdf', '--pdf-renderer', renderer, '--tesseract-config', cfg_file) > assert "parameter not found" in err, "No error message" E AssertionError: No error message E assert 'parameter not found' in 'Traceback (most recent call last):\n File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/r...o suitable UTF-8\nlocales were discovered. This most likely requires resolving\nby reconfiguring the locale system.\n' tests/test_main.py:789: AssertionError ___________________ test_compression_preserved[congress.jpg] ___________________ [gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} ocrmypdf_exec = ['/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m', '-m', 'ocrmypdf'] resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') image = 'congress.jpg' outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_compression_preserved_con0/out.pdf' @pytest.mark.parametrize('image', [ 'baiona.png', 'baiona_gray.png', 'congress.jpg' ]) def test_compression_preserved(spoof_tesseract_noop, ocrmypdf_exec, resources, image, outpdf): from PIL import Image input_file = str(resources / image) output_file = str(outpdf) im = Image.open(input_file) # Runs: ocrmypdf - output.pdf < testfile with open(input_file, 'rb') as input_stream: p_args = ocrmypdf_exec + [ '--image-dpi', '150', '--output-type', 'pdf', '-', output_file] p = Popen( p_args, close_fds=True, stdout=PIPE, stderr=PIPE, stdin=input_stream, env=spoof_tesseract_noop) out, err = p.communicate() > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_main.py:902: AssertionError ______________ test_compression_changed[baiona_gray.png-lossless] ______________ [gw0] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} ocrmypdf_exec = ['/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m', '-m', 'ocrmypdf'] resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') image = 'baiona_gray.png', compression = 'lossless' outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_compression_changed_baion0/out.pdf' @pytest.mark.parametrize('image,compression', [ ('baiona.png', 'jpeg'), ('baiona_gray.png', 'lossless'), ('congress.jpg', 'lossless') ]) def test_compression_changed(spoof_tesseract_noop, ocrmypdf_exec, resources, image, compression, outpdf): from PIL import Image input_file = str(resources / image) output_file = str(outpdf) im = Image.open(input_file) # Runs: ocrmypdf - output.pdf < testfile with open(input_file, 'rb') as input_stream: p_args = ocrmypdf_exec + [ '--image-dpi', '150', '--output-type', 'pdfa', '--pdfa-image-compression', compression, '-', output_file] p = Popen( p_args, close_fds=True, stdout=PIPE, stderr=PIPE, stdin=input_stream, env=spoof_tesseract_noop) out, err = p.communicate() > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_main.py:947: AssertionError _________________ test_compression_preserved[baiona_gray.png] __________________ [gw2] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} ocrmypdf_exec = ['/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m', '-m', 'ocrmypdf'] resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') image = 'baiona_gray.png' outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_compression_preserved_bai1/out.pdf' @pytest.mark.parametrize('image', [ 'baiona.png', 'baiona_gray.png', 'congress.jpg' ]) def test_compression_preserved(spoof_tesseract_noop, ocrmypdf_exec, resources, image, outpdf): from PIL import Image input_file = str(resources / image) output_file = str(outpdf) im = Image.open(input_file) # Runs: ocrmypdf - output.pdf < testfile with open(input_file, 'rb') as input_stream: p_args = ocrmypdf_exec + [ '--image-dpi', '150', '--output-type', 'pdf', '-', output_file] p = Popen( p_args, close_fds=True, stdout=PIPE, stderr=PIPE, stdin=input_stream, env=spoof_tesseract_noop) out, err = p.communicate() > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_main.py:902: AssertionError ____________________________ test_sidecar_pagecount ____________________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_sidecar_pagecount0/out.pdf' def test_sidecar_pagecount(spoof_tesseract_cache, resources, outpdf): sidecar = outpdf + '.txt' check_ocrmypdf( resources / 'multipage.pdf', outpdf, '--skip-text', '--sidecar', sidecar, > env=spoof_tesseract_cache) tests/test_main.py:972: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/multipage.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_sidecar_pagecount0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--skip-text', '--sidecar', '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_sidecar_pagecount0/out.pdf.txt') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError __________________ test_compression_changed[baiona.png-jpeg] ___________________ [gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} ocrmypdf_exec = ['/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m', '-m', 'ocrmypdf'] resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') image = 'baiona.png', compression = 'jpeg' outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_compression_changed_baion0/out.pdf' @pytest.mark.parametrize('image,compression', [ ('baiona.png', 'jpeg'), ('baiona_gray.png', 'lossless'), ('congress.jpg', 'lossless') ]) def test_compression_changed(spoof_tesseract_noop, ocrmypdf_exec, resources, image, compression, outpdf): from PIL import Image input_file = str(resources / image) output_file = str(outpdf) im = Image.open(input_file) # Runs: ocrmypdf - output.pdf < testfile with open(input_file, 'rb') as input_stream: p_args = ocrmypdf_exec + [ '--image-dpi', '150', '--output-type', 'pdfa', '--pdfa-image-compression', compression, '-', output_file] p = Popen( p_args, close_fds=True, stdout=PIPE, stderr=PIPE, stdin=input_stream, env=spoof_tesseract_noop) out, err = p.communicate() > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_main.py:947: AssertionError _______________ test_compression_changed[congress.jpg-lossless] ________________ [gw0] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_noop = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} ocrmypdf_exec = ['/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m', '-m', 'ocrmypdf'] resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') image = 'congress.jpg', compression = 'lossless' outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_compression_changed_congr0/out.pdf' @pytest.mark.parametrize('image,compression', [ ('baiona.png', 'jpeg'), ('baiona_gray.png', 'lossless'), ('congress.jpg', 'lossless') ]) def test_compression_changed(spoof_tesseract_noop, ocrmypdf_exec, resources, image, compression, outpdf): from PIL import Image input_file = str(resources / image) output_file = str(outpdf) im = Image.open(input_file) # Runs: ocrmypdf - output.pdf < testfile with open(input_file, 'rb') as input_stream: p_args = ocrmypdf_exec + [ '--image-dpi', '150', '--output-type', 'pdfa', '--pdfa-image-compression', compression, '-', output_file] p = Popen( p_args, close_fds=True, stdout=PIPE, stderr=PIPE, stdin=input_stream, env=spoof_tesseract_noop) out, err = p.communicate() > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_main.py:947: AssertionError ____________________________ test_sidecar_nonempty _____________________________ [gw2] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_sidecar_nonempty0/out.pdf' def test_sidecar_nonempty(spoof_tesseract_cache, resources, outpdf): sidecar = outpdf + '.txt' check_ocrmypdf( resources / 'ccitt.pdf', outpdf, '--sidecar', sidecar, > env=spoof_tesseract_cache ) tests/test_main.py:991: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/ccitt.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_sidecar_nonempty0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--sidecar', '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_sidecar_nonempty0/out.pdf.txt') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError _________________________________ test_pdfa_1 __________________________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_pdfa_10/out.pdf' def test_pdfa_1(spoof_tesseract_cache, resources, outpdf): check_ocrmypdf( resources / 'ccitt.pdf', outpdf, '--output-type', 'pdfa-1', > env=spoof_tesseract_cache ) tests/test_main.py:1003: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/ccitt.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_pdfa_10/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--output-type', 'pdfa-1') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError _______________________ test_textonly_pdf_on_older_tess3 _______________________ [gw0] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw0/test_textonly_pdf_on_older_tes0/no_output.pdf' @pytest.mark.skipif(tesseract.has_textonly_pdf(), reason="check that missing dep is reported on old tess3") def test_textonly_pdf_on_older_tess3(resources, no_outpdf): p, _, _ = pytest.helpers.run_ocrmypdf( resources / 'linn.pdf', no_outpdf, '--pdf-renderer', 'sandwich') > assert p.returncode == ExitCode.missing_dependency E assert 1 == E + where 1 = .returncode E + and = ExitCode.missing_dependency tests/test_tess3.py:21: AssertionError ______________________________ test_oem_on_tess3 _______________________________ [gw2] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources') no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw2/test_oem_on_tess30/no_output.pdf' def test_oem_on_tess3(resources, no_outpdf): p, _, err = pytest.helpers.run_ocrmypdf( resources / 'aspect.pdf', no_outpdf, '--tesseract-oem', '1') > assert p.returncode == ExitCode.ok E assert 1 == E + where 1 = .returncode E + and = ExitCode.ok tests/test_tess3.py:39: AssertionError _______________________ test_userunit_ghostscript_fails ________________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m poster = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf') no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_userunit_ghostscript_fail0/no_output.pdf' def test_userunit_ghostscript_fails(poster, no_outpdf): p, out, err = run_ocrmypdf(poster, no_outpdf, '--output-type=pdfa') > assert p.returncode == ExitCode.input_file E assert 1 == E + where 1 = .returncode E + and = ExitCode.input_file tests/test_userunit.py:30: AssertionError __________________________ test_userunit_qpdf_passes ___________________________ [gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} poster = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_userunit_qpdf_passes0/out.pdf' def test_userunit_qpdf_passes(spoof_tesseract_cache, poster, outpdf): before = PdfInfo(poster) check_ocrmypdf(poster, outpdf, '--output-type=pdf', > env=spoof_tesseract_cache) tests/test_userunit.py:36: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_userunit_qpdf_passes0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--output-type=pdf',), p = out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError ___________________________ test_rotate_interaction ____________________________ [gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} poster = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf') outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_rotate_interaction0/out.pdf' def test_rotate_interaction(spoof_tesseract_cache, poster, outpdf): check_ocrmypdf(poster, outpdf, '--output-type=pdf', '--rotate-pages', > env=spoof_tesseract_cache) tests/test_userunit.py:45: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf') output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_rotate_interaction0/out.pdf' env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...} args = ('--output-type=pdf', '--rotate-pages') p = , out = '' @pytest.helpers.register def check_ocrmypdf(input_file, output_file, *args, env=None): "Run ocrmypdf and confirmed that a valid file was created" p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env) #print(err) # ensure py.test collects the output, use -s to view > assert p.returncode == 0, "\n" + err + "\n" E AssertionError: E Traceback (most recent call last): E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main E "__main__", mod_spec) E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/runpy.py", line 85, in _run_code E exec(code, run_globals) E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in E _unicodefun._verify_python3_env() E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/_unicodefun.py", line 108, in _verify_python3_env E 'environment.' + extra) E RuntimeError: ocrmypdf will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. E E Additional information: on this system no suitable UTF-8 E locales were discovered. This most likely requires resolving E by reconfiguring the locale system. E E E assert 1 == 0 E + where 1 = .returncode tests/conftest.py:116: AssertionError ============== 111 failed, 11 passed, 6 skipped in 12.61 seconds =============== ```

There are tons more... I just copied what I could from terminal output.

Input/Output files are still from the /tmp folder where things got downloaded. Also the called files like File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in <module> indicate that the tests are done before installation.

sjau commented 6 years ago

I think I found the problem, but not sure how to solve it :) I added this to my nix file:

    echo "-------------------------------------------"
    echo $LANG
    python -c 'import sys; import locale; import codecs; print(locale.getpreferredencoding())'
    python -c 'import sys; import locale; import codecs;  print(codecs.lookup(locale.getpreferredencoding()).name)'
    echo "-------------------------------------------"

And I got this output

-------------------------------------------

ANSI_X3.4-1968
ascii
-------------------------------------------

Even adding

export LANG="utf-8"

doesn't help.

jbarlow83 commented 6 years ago

Have a look here

https://github.com/NixOS/nixpkgs/blob/master/doc/languages-frameworks/python.md under the heading "automatic tests" it suggests

Unicode issues can typically be fixed by including glibcLocales in buildInputs and exporting LC_ALL=en_US.utf-8.

sjau commented 6 years ago

Cool :) pretty sure you know now python on nixos better than me...... adding glibcLocales helped there's still plenty of errors left.

=============== 58 failed, 64 passed, 6 skipped in 93.94 seconds =============== Before it was 111 failed.

There seem to be a few like this:

[gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m

spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...}
poster = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf')
outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_rotate_interaction0/out.pdf'

    def test_rotate_interaction(spoof_tesseract_cache, poster, outpdf):
        check_ocrmypdf(poster, outpdf, '--output-type=pdf',
                       '--rotate-pages',
>                      env=spoof_tesseract_cache)

tests/test_userunit.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf')
output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_rotate_interaction0/out.pdf'
env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...}
args = ('--output-type=pdf', '--rotate-pages')
p = <subprocess.Popen object at 0x7fffe3ed5390>, out = ''

    @pytest.helpers.register
    def check_ocrmypdf(input_file, output_file, *args, env=None):
        "Run ocrmypdf and confirmed that a valid file was created"

        p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env)
        #print(err)  # ensure py.test collects the output, use -s to view
>       assert p.returncode == 0, "<stderr>\n" + err + "\n</stderr>"
E       AssertionError: <stderr>
E           ERROR - Traceback (most recent call last):
E           File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
E             register_cleanup, touch_files_only)
E           File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
E             ret_val = user_defined_work_func(*params)
E           File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/pipeline.py", line 359, in orient_page
E             log=log)
E           File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/exec/tesseract.py", line 124, in get_orientation
E             universal_newlines=True, timeout=timeout)
E           File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 336, in check_output
E             **kwargs).stdout
E           File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 405, in run
E             stdout, stderr = process.communicate(input, timeout=timeout)
E           File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 843, in communicate
E             stdout, stderr = self._communicate(input, endtime, timeout)
E           File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 1554, in _communicate
E             self.stdout.errors)
E           File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 740, in _translate_newlines
E             data = data.decode(encoding, errors)
E         UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 168: invalid start byte
E         
E         
E         </stderr>
E       assert 15 == 0
E        +  where 15 = <subprocess.Popen object at 0x7fffe3ed5390>.returncode

tests/conftest.py:116: AssertionError

and also a few of those:

[gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m

renderer = 'tesseract'
resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources')
outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_pagesize_consistency_tess0/out.pdf'

    @pytest.mark.parametrize('renderer', RENDERERS)
    def test_pagesize_consistency(renderer, resources, outpdf):
        from math import isclose

        first_page_dimensions = pytest.helpers.first_page_dimensions

        infile = resources / 'linn.pdf'

        before_dims = first_page_dimensions(infile)

        check_ocrmypdf(
            infile,
            outpdf, '--pdf-renderer', renderer,
>           '--clean', '--deskew', '--remove-background', '--clean-final')

tests/test_main.py:843: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/linn.pdf')
output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_pagesize_consistency_tess0/out.pdf'
env = None
args = ('--pdf-renderer', 'tesseract', '--clean', '--deskew', '--remove-background', '--clean-final')
p = <subprocess.Popen object at 0x7fffe69fccc0>, out = ''

    @pytest.helpers.register
    def check_ocrmypdf(input_file, output_file, *args, env=None):
        "Run ocrmypdf and confirmed that a valid file was created"

        p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env)
        #print(err)  # ensure py.test collects the output, use -s to view
>       assert p.returncode == 0, "<stderr>\n" + err + "\n</stderr>"
E       AssertionError: <stderr>
E            INFO -    0: background removal skipped on mono page
E         
E           ERROR - Traceback (most recent call last):
E           File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
E             register_cleanup, touch_files_only)
E           File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
E             ret_val = user_defined_work_func(*params)
E           File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/pipeline.py", line 475, in preprocess_deskew
E             leptonica.deskew(input_file, output_file, dpi)
E           File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/leptonica.py", line 522, in deskew
E             pix_source = Pix.read(infile)
E           File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/leptonica.py", line 220, in read
E             return cls(lept.pixRead(os.fsencode(filename)))
E         ffi.error: symbol 'pixRead' not found in library '<None>': /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m: undefined symbol: pixRead
E         
E         
E         </stderr>
E       assert 15 == 0
E        +  where 15 = <subprocess.Popen object at 0x7fffe69fccc0>.returncode

tests/conftest.py:116: AssertionError
sjau commented 6 years ago

Why is it so icky about encodings?

jbarlow83 commented 6 years ago

ocrmypdf and tesseract both work with about 100 languages so encodings have to be respected.

I don't think that's an encoding problem, though. '0xff' does not appear in valid utf8 streams. Something else went wrong that produced a '0xff' in Tesseract's stdout, and that caused a encoding error. The assert that triggered that error message indicates Tesseract returned nonzero (probably crashed and corrupted its stdout.)

A few thoughts:

sjau commented 6 years ago

Adding leptonica as build input now produces a whole different bunch of new errors. Also it aborts after few tests:

generating cffi module 'ocrmypdf/lib/_leptonica.py'
============================= test session starts ==============================
platform linux -- Python 3.6.3, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source, inifile: setup.cfg
plugins: forked-5.4.3, xdist-1.20.1, cov-2.4.0, helpers-namespace-2017.11.11
gw0 I / gw1 I / gw2 I / gw3 I
gw0 [16] / gw1 [16] / gw2 [16] / gw3 [16]

scheduling tests via LoadScheduling
........ssFFssss
==================================== ERRORS ====================================
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
    from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
    from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
    from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
    from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
    from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
    from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
    from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
    from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
    from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
    lept = ffi.dlopen(find_library('lept'))
E   OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
=================================== FAILURES ===================================
______________________________ test_oem_on_tess3 _______________________________
[gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m

resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources')
no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_oem_on_tess30/no_output.pdf'

    def test_oem_on_tess3(resources, no_outpdf):
        p, _, err = pytest.helpers.run_ocrmypdf(
            resources / 'aspect.pdf',
            no_outpdf, '--tesseract-oem', '1')

>       assert p.returncode == ExitCode.ok
E       assert 1 == <ExitCode.ok: 0>
E        +  where 1 = <subprocess.Popen object at 0x7fffe4033240>.returncode
E        +  and   <ExitCode.ok: 0> = ExitCode.ok

tests/test_tess3.py:39: AssertionError
_______________________ test_textonly_pdf_on_older_tess3 _______________________
[gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m

resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources')
no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_textonly_pdf_on_older_tes0/no_output.pdf'

    @pytest.mark.skipif(tesseract.has_textonly_pdf(),
                        reason="check that missing dep is reported on old tess3")
    def test_textonly_pdf_on_older_tess3(resources, no_outpdf):
        p, _, _ = pytest.helpers.run_ocrmypdf(
            resources / 'linn.pdf',
            no_outpdf, '--pdf-renderer', 'sandwich')

>       assert p.returncode == ExitCode.missing_dependency
E       assert 1 == <ExitCode.missing_dependency: 3>
E        +  where 1 = <subprocess.Popen object at 0x7fffe632fcf8>.returncode
E        +  and   <ExitCode.missing_dependency: 3> = ExitCode.missing_dependency

tests/test_tess3.py:21: AssertionError
=========== 2 failed, 8 passed, 6 skipped, 12 error in 1.53 seconds ============
builder for ‘/nix/store/p2z58fm2wpai0k5s89ppi6ns3lkdzh9d-OCRmyPDF-5.4.3.drv’ failed with exit code 1

Also, leptonica is defined as dependency of tesseract. Adding it to OCRmyPDF shouldn't produce different results - I think.

https://github.com/NixOS/nixpkgs/blob/master/pkgs/applications/graphics/tesseract/default.nix

Anyway, I did now remove leptonica again from OCRmyPDF and did set doCheck to false. It installed fine. Also test seems fine:

[root@nixos:/tankJL/opt]# ocrmypdf --rotate-pages -l deu test.pdf output.pdf
   INFO -    2: page is facing ⇧, confidence 5.65 - no change
   INFO -    1: page is facing ⇧, confidence 7.81 - no change
   INFO -    4: page is facing ⇧, confidence 10.07 - no change
   INFO -    3: page is facing ⇧, confidence 13.32 - no change
   INFO -    6: page is facing ⇧, confidence 10.41 - no change
   INFO -    5: page is facing ⇧, confidence 10.50 - no change
   INFO -    8: page is facing ⇧, confidence 14.63 - rotation appears correct
   INFO -    7: page is facing ⇧, confidence 14.18 - rotation appears correct
   INFO -   12: [tesseract] Too few characters. Skipping this page
  ERROR -   12: [tesseract] Error during processing.
   INFO -   12: page is facing ⇧, confidence 0.00 - no change
   INFO -   11: page is facing ⇧, confidence 9.49 - no change
   INFO -    9: page is facing ⇧, confidence 12.77 - no change
   INFO -   10: page is facing ⇧, confidence 14.84 - rotation appears correct
   INFO -   13: page is facing ⇧, confidence 12.63 - no change
   INFO -   14: page is facing ⇧, confidence 12.60 - no change
   INFO -   17: page is facing ⇧, confidence 1.71 - no change
   INFO -   15: page is facing ⇧, confidence 9.40 - no change
   INFO -   16: page is facing ⇧, confidence 12.01 - no change
   INFO -   18: page is facing ⇧, confidence 12.24 - no change
   INFO -   19: page is facing ⇧, confidence 17.98 - rotation appears correct
   INFO -   20: page is facing ⇧, confidence 12.58 - no change
   INFO -   21: page is facing ⇧, confidence 7.41 - no change
   INFO -   22: page is facing ⇧, confidence 11.30 - no change
   INFO -   23: page is facing ⇧, confidence 8.56 - no change
   INFO -   24: page is facing ⇧, confidence 11.08 - no change
   INFO -   25: page is facing ⇧, confidence 6.31 - no change
   INFO -   26: page is facing ⇧, confidence 6.04 - no change
   INFO -   27: page is facing ⇧, confidence 6.82 - no change
   INFO -   28: page is facing ⇧, confidence 7.70 - no change
   INFO -   29: page is facing ⇧, confidence 8.18 - no change
   INFO -   30: page is facing ⇧, confidence 6.42 - no change
   INFO -   31: page is facing ⇧, confidence 8.42 - no change
   INFO -   32: page is facing ⇧, confidence 9.55 - no change
WARNING -   12: [tesseract] unsure about page orientation
WARNING -   30: [tesseract] lots of diacritics - possibly poor OCR
WARNING -   25: [tesseract] lots of diacritics - possibly poor OCR
WARNING -   32: [tesseract] lots of diacritics - possibly poor OCR
WARNING -   31: [tesseract] lots of diacritics - possibly poor OCR
   INFO - Output file is a PDF/A-2B (as expected)

Tested the output in Okular and recognition works really, really well :)

Still bothered about the tests though :)

One question: Is there also an option to rotate the page but not do OCR? It seems from the output above that this is a 2-step process. Sometimes I'd just like to rotate things correctly but don't need OCR.

jbarlow83 commented 6 years ago

--tesseract-timeout 0 will do whatever image processing (including page rotation) you want but skip OCR.

I did a search for "find_library nixos" and found this open issue. https://github.com/NixOS/nixpkgs/issues/7307 Yikes. Basically find_library() is broken on NixOS and you need to replace it with a hardcoded path. It probably is better to make the liblept dependency explicit and do this. This patching should be done prior to building/running compile_leptonica.py.

Can you manually run the tests on your installed environment? Meaning $packagemanager install ocrmypdf; python3 -m pytest in a folder that contains unzipped ocrmypdf with the tests/ folder.

sjau commented 6 years ago

Well, I have to see with that. I noticed that if I skip the doCheck and hence allow it to build and install properly, all seems to work except the --deskew option. Since I need ocr now I'll submit it already for inclusion with a note the --deskew isn't working (yet).

That way, others could already profit from it. There was someone in the nixos irc channel that actually was interested in an OCR tool :)

Thanks for the support this far.

jbarlow83 commented 6 years ago

--remove-background will also not work if leptonica is not available, and may some other cases. It is well worth running the test suite post install to see what works.

sjau commented 6 years ago

So, scanned in various documents today and yesterday and it's been working all fine. Still need to fix leptonica though :)

sjau commented 6 years ago

So, been trying to debug that leptonica issue.

I did now add leptonica explicitely as dependency and also changed the find_lib in your leptonica.py to a fixed path:

  postPatch = ''
    substituteInPlace requirements.txt \
      --replace "ruffus == 2.6.3" "ruffus" \
      --replace "Pillow == 4.3.0" "Pillow" \
      --replace "reportlab == 3.4.0" "reportlab" \
      --replace "PyPDF2 == 1.26.0" "PyPDF2" \
      --replace "img2pdf == 0.2.4" "img2pdf" \
      --replace "cffi == 1.11.2" "cffi"
    substituteInPlace test_requirements.txt \
      --replace "pytest >= 3.0" "pytest"
    substituteInPlace ocrmypdf/lib/compile_leptonica.py \
      --replace "©" "(c)"
    substituteInPlace ocrmypdf/leptonica.py \
      --replace "ffi.dlopen(find_library('lept'))" "ffi.dlopen('${leptonica}/lib/liblept.so.5')"
    grep "ffi.dlopen" ocrmypdf/leptonica.py
    sleep 10
    export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
    export LANG=en_US.UTF-8
    export LC_ALL=en_US.UTF-8
  '';

  buildInputs = [ glibcLocales ];

  checkInputs = [  pytest pytest_xdist pytestcov setuptools_scm pytest-helpers-namespace pytestrunner ];

  #doCheck = false;

  propagatedBuildInputs = [
    ruffus
    pillow
    reportlab
    pypdf2
    img2pdf
    cffi
    unpaper
    ghostscript
    tesseract
    qpdf
    leptonica
  ];

It outputs first lept = ffi.dlopen('/nix/store/cz93v0lipkhk3g90gc738dd6srb76pxa-leptonica-1.74.1/lib/liblept.so.5') so it seems to have the correct path. But it still ends at the 58 failed, 64 passed, like here: https://github.com/jbarlow83/OCRmyPDF/issues/202#issuecomment-346962096

ls -al /nix/store/cz93v0lipkhk3g90gc738dd6srb76pxa-leptonica-1.74.1/lib/liblept.so.5
lrwxrwxrwx 3 root root 16  1. Jan 1970  /nix/store/cz93v0lipkhk3g90gc738dd6srb76pxa-leptonica-1.74.1/lib/liblept.so.5 -> liblept.so.5.0.1

Althought it seems the errors are different now. I think it's better to post that wall of error output to a seperate paste: https://paste.simplylinux.ch/view/raw/3118365d

jbarlow83 commented 6 years ago

I can only see two unique errors in there.

1.

One error you're getting frequently is #140 which incidentally came up for someone else again. I can't reproduce that, but it seems like the problem is Tesseract's language packs being incompatible with the installed Tesseract binary. At a glance it's possible that Tesseract is not correctly installed on your machine or NixOS in general....

Try again with v5.4.4 since I made a change that should prevent Python from suppressing the error message here.

Specifically the message

builtins.UnicodeDecodeError('utf-8' codec can't decode **byte 0xff in position 168**: invalid start byte)

should be replaced by some kind of error output. Hopefully we'll get something more informative.

2.

The other is Leptonica, still. Here you could try switching to [API binding]((https://cffi.readthedocs.io/en/latest/overview.html#real-example-api-level-out-of-line) since perhaps we can write off CFFI ABI binding as not possible due to an open issue.

The CFFI documentation explains the difference better. In short I have a feeling that API binding will work around the pecularities of NixOS more easily, but I think ABI binding is better for most of my users so you'd have to maintain this as a patch.

I pushed a branch "leptonica-api-binding" that implements the change. Be sure to manually delete ocrmypdf/lib/_leptonica.py and then try it out. The build inputs will need to request a C compiler + linker and leptonica's headers. If you works you can check the diff against v5.4.4 and make that part of your patching.

sjau commented 6 years ago

Will try those :) thx

sjau commented 6 years ago

Ok, v5.4.4 fixed some of the errors. I'll have a look about changinge the CFFI ABI binding.

============== 57 failed, 72 passed, 7 skipped in 117.49 seconds ===============

sjau commented 6 years ago

I just know too little about python and stuff to be of any help... can't figure out the problem.

https://paste.simplylinux.ch/view/raw/6e9c653f

Thanks for the help. As said, it works fine this far to do ocr... deskew and background aren't working but I can live fine with that.

jbarlow83 commented 6 years ago

Some basic sanity tests that are still failing. I wouldn't trust it – consider using the Docker image.

I'll close this for now since the outstanding issues are probably in the NixOS environment. Reopen if you have further questions.

aakropotkin commented 5 years ago

Hey could you send me the nix file? (If you got it working)

sjau commented 5 years ago

https://github.com/sjau/nix-expressions/blob/master/ocrmypdf.nix

Doesn't work on current nixos anymore since underlaying packages have changed.

Atemu commented 1 year ago

For future reference, ocrmypdf has been packaged and is available in regular Nixpkgs.

sjau commented 1 year ago

For future reference, ocrmypdf has been packaged and is available in regular Nixpkgs.

👍 Thanks for pointing this out. So far I've just used a container :)