Closed sjau closed 6 years ago
Never heard of NixOS but it looks interesting.
You'll need to set the locale to UTF8, e.g. environmemt variable LANG=C.utf-8
Probably for both set up and execution. Python 3 does not work well when the system locale is not Unicode aware.
On Nov 24, 2017 12:07 PM, "sjau" notifications@github.com wrote:
Hi there
I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.
Anyway, I do get this error when it's trying to build OCRmyPDF:
building path(s) ‘/nix/store/kdpr7qaz85lrls5mwqyvgrfi5v811i5q-ORCmyPDF-5.4.3’ unpacking sources unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source source root is source setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py patching sources configuring building Skipping external program tests because of --force Traceback (most recent call last): File "nix_run_setup.py", line 8, in
exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec')) File "setup.py", line 245, in zip_safe=False) File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup _setup_distribution = dist = klass(attrs) File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in init _Distribution.init(self, attrs) File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in init self.finalize_options() File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options ep.load()(self, ep.name, value) File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules add_cffi_module(dist, cffi_module) File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module execfile(build_file_name, mod_vars) File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile src = f.read() File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128) builder for ‘/nix/store/jsnfzz199dy49viv14l1is2i1d2r3lq9-ORCmyPDF-5.4.3.drv’ failed with exit code 1 cannot build derivation ‘/nix/store/niq3y1rw30sqx5gp5jwrd273hlv6xhb2-system-path.drv’: 1 dependencies couldn't be built cannot build derivation ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built error: build of ‘/nix/store/1l09hph7klh4a3p3rzlpn19qdzggbxdy-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed The current nix expression that I use to try to build it looks like:
{ lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:
with python3.pkgs;
let
ruffus = callPackage ("/tankJL/opt/ruffus.nix") {}; img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};
in
buildPythonApplication rec { version = "5.4.3"; name = "ORCmyPDF-${version}";
src = fetchFromGitHub { owner = "jbarlow83"; repo = "OCRmyPDF"; rev = version; sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf"; };
postPatch = '' substituteInPlace requirements.txt \ --replace "ruffus == 2.6.3" "ruffus" \ --replace "Pillow == 4.3.0" "Pillow" \ --replace "reportlab == 3.4.0" "reportlab" \ --replace "PyPDF2 == 1.26.0" "PyPDF2" \ --replace "img2pdf == 0.2.4" "img2pdf" \ --replace "cffi == 1.11.2" "cffi" substituteInPlace test_requirements.txt \ --replace "pytest >= 3.0" "pytest" export SETUPTOOLS_SCM_PRETEND_VERSION="${version}" '';
buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];
propagatedBuildInputs = [ ruffus pillow reportlab pypdf2 img2pdf cffi unpaper ghostscript tesseract qpdf ];
meta = { homepage = https://github.com/jbarlow83/OCRmyPDF; description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted."; license = lib.licenses.mit; maintainers = with lib.maintainers; [ hyper_ch ]; }; }
I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/202, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcM1sjP_aQb_Kfcvx9UyvgeqtvJnksks5s5yHugaJpZM4QqJ7N .
Hi thanks for the quick answer. I tried setting now lang accordingly to your suggestion, however nothing has changed:
building path(s) ‘/nix/store/afxj0mmia4rnjhimksn8cx7hlizdnj3c-ORCmyPDF-5.4.3’
unpacking sources
unpacking source archive /nix/store/ajl9ibrhpbbrrccnyb7s7rl4ix8w7k48-source
source root is source
setting SOURCE_DATE_EPOCH to timestamp 315619200 of file source/tests/test_userunit.py
patching sources
-------------------------------------------
C.utf-8
-------------------------------------------
configuring
building
Skipping external program tests because of --force
Traceback (most recent call last):
File "nix_run_setup.py", line 8, in <module>
exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
File "setup.py", line 245, in <module>
zip_safe=False)
File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/core.py", line 108, in setup
_setup_distribution = dist = klass(attrs)
File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 338, in __init__
_Distribution.__init__(self, attrs)
File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/distutils/dist.py", line 281, in __init__
self.finalize_options()
File "/nix/store/c08inn71kyzh6ambhh4b3q3h8cbfbfw5-python3.6-bootstrapped-pip-9.0.1/lib/python3.6/site-packages/setuptools/dist.py", line 471, in finalize_options
ep.load()(self, ep.name, value)
File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
add_cffi_module(dist, cffi_module)
File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
execfile(build_file_name, mod_vars)
File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
src = f.read()
File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
builder for ‘/nix/store/6mvi91imkrlb2dbgqil0zx2h2m9avz0y-ORCmyPDF-5.4.3.drv’ failed with exit code 1
cannot build derivation ‘/nix/store/wncx82hszxhmpcxwgh3zm57d65dsf7iy-system-path.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/5f1np8np28rrx8bandg711pcbrs9470a-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
error: build of ‘/nix/store/5f1np8np28rrx8bandg711pcbrs9470a-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’ failed
{ lib, fetchFromGitHub, python3, callPackage, pytest, unpaper, ghostscript, tesseract, qpdf }:
with python3.pkgs;
let
ruffus = callPackage ("/tankJL/opt/ruffus.nix") {};
img2pdf = callPackage ("/tankJL/opt/img2pdf.nix") {};
in
buildPythonApplication rec {
version = "5.4.3";
name = "ORCmyPDF-${version}";
src = fetchFromGitHub {
owner = "jbarlow83";
repo = "OCRmyPDF";
rev = version;
sha256 = "0vnn6g69vkqldbx76llmyz8h9ia7mkxcp290mxdsydy4bjjik6zf";
};
postPatch = ''
substituteInPlace requirements.txt \
--replace "ruffus == 2.6.3" "ruffus" \
--replace "Pillow == 4.3.0" "Pillow" \
--replace "reportlab == 3.4.0" "reportlab" \
--replace "PyPDF2 == 1.26.0" "PyPDF2" \
--replace "img2pdf == 0.2.4" "img2pdf" \
--replace "cffi == 1.11.2" "cffi"
substituteInPlace test_requirements.txt \
--replace "pytest >= 3.0" "pytest"
export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
echo "-------------------------------------------"
echo $LANG
export LANG="C.utf-8"
echo $LANG
echo "-------------------------------------------"
'';
buildInputs = [ pytest pytest_xdist pytestcov setuptools_scm ];
propagatedBuildInputs = [
ruffus
pillow
reportlab
pypdf2
img2pdf
cffi
unpaper
ghostscript
tesseract
qpdf
];
meta = {
homepage = https://github.com/jbarlow83/OCRmyPDF;
description = "Adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.";
license = lib.licenses.mit;
maintainers = with lib.maintainers; [ hyper_ch ];
};
}
I also tried en_US.UTF-8 with the same result.
It's possible your package manager changes the LANG environment in a child process and then forks a different child to run Python. See if python3 -c 'import os; print(os.environ['LANG'])'
actually reflects the change in locale.
Although maybe there's an explicit command to tell NixOS it needs a UTF-8 locale? I suspect other Py3 packages have dealt with this. The underlying issue is being fixed Python-wide for 3.7.
If you want to try hacking around it, it looks like it is choking on the utf-8 copyright symbol in ocrmypdf/lib/compile_leptonica.py
, so you could replace that so that one script would execute and setup would likely complete. However ocrmypdf still needs a utf-8 locale to do its work, so it would probably fail the test suite if its execution environment is not truly utf-8.
When you do have it working please submit a PR to update the docs with installation instructions.
Hi
Thanks for the feedback.... I did now this:
echo "-------------------------------------------"
echo $LANG
# python3 -c 'import os; print(os.environ['LANG'])'
python -c 'import sys; print(sys.getdefaultencoding())'
export LANG="en_US.UTF-8"
# python3 -c 'import os; print(os.environ['LANG'])'
python -c 'import sys; print(sys.getdefaultencoding())'
echo $LANG
echo "-------------------------------------------
Yours outputted errors, I also tried to change the quotes but couldn't get it to work but on the internet I found the getdefaultencoding variant. I don't know if there's a difference between them.
Anyway, here's the output:
-------------------------------------------
utf-8
utf-8
en_US.UTF-8
-------------------------------------------
Before explicitely setting the env var for LANG it doesn't contain any value. However the getdefaultencoding variant still outputs utf-8. After settings the LANG env, it outputs still utf-8 as getdefaultencoding and en_US.UTF-8 for the LANG.
Also, how did you figure out the problem likely resides in the compile_leptonica.py file?
Yes, the problem was the copyright sign. Having added to the postPatch process:
substituteInPlace ocrmypdf/lib/compile_leptonica.py \
--replace "©" "(c)"
it outputted a different error now....
running install tests
Checking for tesseract >= 3.04...
Found tesseract 3.05.00
Checking for gs >= 9.15...
Found gs 9.20
Checking for unpaper >= 6.1...
Found unpaper 6.1
Checking for qpdf >= 5.1.1...
Found qpdf 7.0.0
usage: nix_run_setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: nix_run_setup.py --help [cmd1 cmd2 ...]
or: nix_run_setup.py --help-commands
or: nix_run_setup.py cmd --help
error: invalid command 'pytest'
builder for ‘/nix/store/hy1fdjchpd25mggdk4pf6q0c0lvbyd6x-ORCmyPDF-5.4.3.drv’ failed with exit code 1
cannot build derivation ‘/nix/store/8kj6lik5cpx0archdqcn4m1zm6an7z6a-system-path.drv’: 1 dependencies couldn't be built
cannot build derivation ‘/nix/store/5zv7kr2mnza8bpdia7yj44qlxk91m6lx-nixos-system-nixos-18.03pre120831.cce47a6bf5.drv’: 1 dependencies couldn't be built
Let me see if I can get that to work :)
So, since I "fixed" it I think this bug report can be closed. For me there doesn't seem to be a need to replaced the copyright sign in your source.
error: invalid command 'pytest'
This can be fixed by installing pytest-runner
prior to trying to run the command python3 setup.py test
.
Strictly speaking pytest-runner
should be added to setup_requires
in setup.py. I will add that but you can do it explicitly too.
Watch out for the import order here. Are you running the tests before or after install? It's probably better to install, mv ocrmypdf src
in the directory that contains setup.py, and then run the test suite. This will ensure the installed version gets picked up instead of the uninstalled files in the local folder. (It turns that it's a better practice to have a top level src/ folder to avoid this problem, at least for packages like mine that can't be tested before installation, but I don't want to change it until a major release since it will inconvenience all downstream maintainers.)
It might be helpful to look at what the Debian maintainer did. Some of these problems are similar. https://git.spwhitton.name/ocrmypdf/tree/debian?h=debian
Also, how did you figure out the problem likely resides in the compile_leptonica.py file?
File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 188, in cffi_modules
add_cffi_module(dist, cffi_module)
File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 49, in add_cffi_module
execfile(build_file_name, mod_vars)
File "/nix/store/snn4yrd7kqhwb9l0j16i9lsl8jh1hibd-python3.6-cffi-1.11.2/lib/python3.6/site-packages/cffi/setuptools_ext.py", line 22, in execfile
src = f.read()
File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
src = f.read()
File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
The clues in the stack trace were: this has to do with building cffi modules and I only have one of those, leptonica. I looked up the source for cffi/setuptools_ext.py and this is where it executes the package's compile script.
It failed at position 2. What's a byte offset 2 in that actual file? Non-ASCII symbol.
Ok, adding pytest-runner helps. It wasn't listed here: https://github.com/jbarlow83/OCRmyPDF/blob/master/test_requirements.txt :)
As for the order: I don't know for sure. It seems the tests are run before installation - I get tons of AssertionErrors
There are tons more... I just copied what I could from terminal output.
Input/Output files are still from the /tmp folder where things got downloaded. Also the called files like File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/__main__.py", line 53, in <module>
indicate that the tests are done before installation.
I think I found the problem, but not sure how to solve it :) I added this to my nix file:
echo "-------------------------------------------"
echo $LANG
python -c 'import sys; import locale; import codecs; print(locale.getpreferredencoding())'
python -c 'import sys; import locale; import codecs; print(codecs.lookup(locale.getpreferredencoding()).name)'
echo "-------------------------------------------"
And I got this output
-------------------------------------------
ANSI_X3.4-1968
ascii
-------------------------------------------
Even adding
export LANG="utf-8"
doesn't help.
Have a look here
https://github.com/NixOS/nixpkgs/blob/master/doc/languages-frameworks/python.md under the heading "automatic tests" it suggests
Unicode issues can typically be fixed by including glibcLocales in buildInputs and exporting LC_ALL=en_US.utf-8.
Cool :) pretty sure you know now python on nixos better than me...... adding glibcLocales helped there's still plenty of errors left.
=============== 58 failed, 64 passed, 6 skipped in 93.94 seconds ===============
Before it was 111 failed.
There seem to be a few like this:
[gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m
spoof_tesseract_cache = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...}
poster = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf')
outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_rotate_interaction0/out.pdf'
def test_rotate_interaction(spoof_tesseract_cache, poster, outpdf):
check_ocrmypdf(poster, outpdf, '--output-type=pdf',
'--rotate-pages',
> env=spoof_tesseract_cache)
tests/test_userunit.py:45:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/poster.pdf')
output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_rotate_interaction0/out.pdf'
env = {'AR': 'ar', 'AS': 'as', 'CC': 'gcc', 'CONFIG_SHELL': '/nix/store/4ada72n7785wwazv42fhsnxjvilaa3aj-bash-4.4-p12/bin/bash', ...}
args = ('--output-type=pdf', '--rotate-pages')
p = <subprocess.Popen object at 0x7fffe3ed5390>, out = ''
@pytest.helpers.register
def check_ocrmypdf(input_file, output_file, *args, env=None):
"Run ocrmypdf and confirmed that a valid file was created"
p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env)
#print(err) # ensure py.test collects the output, use -s to view
> assert p.returncode == 0, "<stderr>\n" + err + "\n</stderr>"
E AssertionError: <stderr>
E ERROR - Traceback (most recent call last):
E File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
E register_cleanup, touch_files_only)
E File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
E ret_val = user_defined_work_func(*params)
E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/pipeline.py", line 359, in orient_page
E log=log)
E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/exec/tesseract.py", line 124, in get_orientation
E universal_newlines=True, timeout=timeout)
E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 336, in check_output
E **kwargs).stdout
E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 405, in run
E stdout, stderr = process.communicate(input, timeout=timeout)
E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 843, in communicate
E stdout, stderr = self._communicate(input, endtime, timeout)
E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 1554, in _communicate
E self.stdout.errors)
E File "/nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/lib/python3.6/subprocess.py", line 740, in _translate_newlines
E data = data.decode(encoding, errors)
E UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 168: invalid start byte
E
E
E </stderr>
E assert 15 == 0
E + where 15 = <subprocess.Popen object at 0x7fffe3ed5390>.returncode
tests/conftest.py:116: AssertionError
and also a few of those:
[gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m
renderer = 'tesseract'
resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources')
outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_pagesize_consistency_tess0/out.pdf'
@pytest.mark.parametrize('renderer', RENDERERS)
def test_pagesize_consistency(renderer, resources, outpdf):
from math import isclose
first_page_dimensions = pytest.helpers.first_page_dimensions
infile = resources / 'linn.pdf'
before_dims = first_page_dimensions(infile)
check_ocrmypdf(
infile,
outpdf, '--pdf-renderer', renderer,
> '--clean', '--deskew', '--remove-background', '--clean-final')
tests/test_main.py:843:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input_file = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources/linn.pdf')
output_file = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_pagesize_consistency_tess0/out.pdf'
env = None
args = ('--pdf-renderer', 'tesseract', '--clean', '--deskew', '--remove-background', '--clean-final')
p = <subprocess.Popen object at 0x7fffe69fccc0>, out = ''
@pytest.helpers.register
def check_ocrmypdf(input_file, output_file, *args, env=None):
"Run ocrmypdf and confirmed that a valid file was created"
p, out, err = run_ocrmypdf(input_file, output_file, *args, env=env)
#print(err) # ensure py.test collects the output, use -s to view
> assert p.returncode == 0, "<stderr>\n" + err + "\n</stderr>"
E AssertionError: <stderr>
E INFO - 0: background removal skipped on mono page
E
E ERROR - Traceback (most recent call last):
E File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
E register_cleanup, touch_files_only)
E File "/nix/store/m9drfazy2zqrlw9l6pgz913x1xwyzzsl-python3.6-ruffus-2.6.3/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
E ret_val = user_defined_work_func(*params)
E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/pipeline.py", line 475, in preprocess_deskew
E leptonica.deskew(input_file, output_file, dpi)
E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/leptonica.py", line 522, in deskew
E pix_source = Pix.read(infile)
E File "/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/ocrmypdf/leptonica.py", line 220, in read
E return cls(lept.pixRead(os.fsencode(filename)))
E ffi.error: symbol 'pixRead' not found in library '<None>': /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m: undefined symbol: pixRead
E
E
E </stderr>
E assert 15 == 0
E + where 15 = <subprocess.Popen object at 0x7fffe69fccc0>.returncode
tests/conftest.py:116: AssertionError
Why is it so icky about encodings?
ocrmypdf and tesseract both work with about 100 languages so encodings have to be respected.
I don't think that's an encoding problem, though. '0xff' does not appear in valid utf8 streams. Something else went wrong that produced a '0xff' in Tesseract's stdout, and that caused a encoding error. The assert that triggered that error message indicates Tesseract returned nonzero (probably crashed and corrupted its stdout.)
A few thoughts:
Can you create a virtual environment, pip install ocrmypdf
into it (i.e. installing from the wheel), and run pytest on a folder containing the test suite? That would confirm that you can, independent of the package manager, that the system has what it needs to run ocrmypdf.
Try doCheck = false;
to disable the test suite. See if you can install and run the test suite manually. If that works you could run the test suite in a "postInstall hook" that calls pytest directly. At least that separates out any pre-install weirdness.
The 'pixRead' error is new to me. It looks like it could not find the leptonica library at all. Maybe leptonica (liblept) needs to be explicitly named as a dependency (rather than it being picked up through tesseract).
The directory tests/cache and tests/output need to be writable.
Adding leptonica as build input now produces a whole different bunch of new errors. Also it aborts after few tests:
generating cffi module 'ocrmypdf/lib/_leptonica.py'
============================= test session starts ==============================
platform linux -- Python 3.6.3, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source, inifile: setup.cfg
plugins: forked-5.4.3, xdist-1.20.1, cov-2.4.0, helpers-namespace-2017.11.11
gw0 I / gw1 I / gw2 I / gw3 I
gw0 [16] / gw1 [16] / gw2 [16] / gw3 [16]
scheduling tests via LoadScheduling
........ssFFssss
==================================== ERRORS ====================================
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
_____________________ ERROR collecting tests/test_main.py ______________________
tests/test_main.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
________________ ERROR collecting tests/test_multiprocessing.py ________________
tests/test_multiprocessing.py:3: in <module>
from ocrmypdf.pipeline import JobContext, JobContextManager
ocrmypdf/pipeline.py:21: in <module>
from . import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
___________________ ERROR collecting tests/test_userunit.py ____________________
tests/test_userunit.py:11: in <module>
from ocrmypdf import leptonica
ocrmypdf/leptonica.py:19: in <module>
lept = ffi.dlopen(find_library('lept'))
E OSError: cannot load library 'liblept.so.5': liblept.so.5: cannot open shared object file: No such file or directory
=================================== FAILURES ===================================
______________________________ test_oem_on_tess3 _______________________________
[gw3] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m
resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources')
no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw3/test_oem_on_tess30/no_output.pdf'
def test_oem_on_tess3(resources, no_outpdf):
p, _, err = pytest.helpers.run_ocrmypdf(
resources / 'aspect.pdf',
no_outpdf, '--tesseract-oem', '1')
> assert p.returncode == ExitCode.ok
E assert 1 == <ExitCode.ok: 0>
E + where 1 = <subprocess.Popen object at 0x7fffe4033240>.returncode
E + and <ExitCode.ok: 0> = ExitCode.ok
tests/test_tess3.py:39: AssertionError
_______________________ test_textonly_pdf_on_older_tess3 _______________________
[gw1] linux -- Python 3.6.3 /nix/store/166s7l3yjqfc8dj5hfqjb09dbfvp1850-python3-3.6.3/bin/python3.6m
resources = PosixPath('/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/source/tests/resources')
no_outpdf = '/tmp/nix-build-OCRmyPDF-5.4.3.drv-0/pytest-of-nixbld1/pytest-0/popen-gw1/test_textonly_pdf_on_older_tes0/no_output.pdf'
@pytest.mark.skipif(tesseract.has_textonly_pdf(),
reason="check that missing dep is reported on old tess3")
def test_textonly_pdf_on_older_tess3(resources, no_outpdf):
p, _, _ = pytest.helpers.run_ocrmypdf(
resources / 'linn.pdf',
no_outpdf, '--pdf-renderer', 'sandwich')
> assert p.returncode == ExitCode.missing_dependency
E assert 1 == <ExitCode.missing_dependency: 3>
E + where 1 = <subprocess.Popen object at 0x7fffe632fcf8>.returncode
E + and <ExitCode.missing_dependency: 3> = ExitCode.missing_dependency
tests/test_tess3.py:21: AssertionError
=========== 2 failed, 8 passed, 6 skipped, 12 error in 1.53 seconds ============
builder for ‘/nix/store/p2z58fm2wpai0k5s89ppi6ns3lkdzh9d-OCRmyPDF-5.4.3.drv’ failed with exit code 1
Also, leptonica is defined as dependency of tesseract. Adding it to OCRmyPDF shouldn't produce different results - I think.
https://github.com/NixOS/nixpkgs/blob/master/pkgs/applications/graphics/tesseract/default.nix
Anyway, I did now remove leptonica again from OCRmyPDF and did set doCheck to false. It installed fine. Also test seems fine:
[root@nixos:/tankJL/opt]# ocrmypdf --rotate-pages -l deu test.pdf output.pdf
INFO - 2: page is facing ⇧, confidence 5.65 - no change
INFO - 1: page is facing ⇧, confidence 7.81 - no change
INFO - 4: page is facing ⇧, confidence 10.07 - no change
INFO - 3: page is facing ⇧, confidence 13.32 - no change
INFO - 6: page is facing ⇧, confidence 10.41 - no change
INFO - 5: page is facing ⇧, confidence 10.50 - no change
INFO - 8: page is facing ⇧, confidence 14.63 - rotation appears correct
INFO - 7: page is facing ⇧, confidence 14.18 - rotation appears correct
INFO - 12: [tesseract] Too few characters. Skipping this page
ERROR - 12: [tesseract] Error during processing.
INFO - 12: page is facing ⇧, confidence 0.00 - no change
INFO - 11: page is facing ⇧, confidence 9.49 - no change
INFO - 9: page is facing ⇧, confidence 12.77 - no change
INFO - 10: page is facing ⇧, confidence 14.84 - rotation appears correct
INFO - 13: page is facing ⇧, confidence 12.63 - no change
INFO - 14: page is facing ⇧, confidence 12.60 - no change
INFO - 17: page is facing ⇧, confidence 1.71 - no change
INFO - 15: page is facing ⇧, confidence 9.40 - no change
INFO - 16: page is facing ⇧, confidence 12.01 - no change
INFO - 18: page is facing ⇧, confidence 12.24 - no change
INFO - 19: page is facing ⇧, confidence 17.98 - rotation appears correct
INFO - 20: page is facing ⇧, confidence 12.58 - no change
INFO - 21: page is facing ⇧, confidence 7.41 - no change
INFO - 22: page is facing ⇧, confidence 11.30 - no change
INFO - 23: page is facing ⇧, confidence 8.56 - no change
INFO - 24: page is facing ⇧, confidence 11.08 - no change
INFO - 25: page is facing ⇧, confidence 6.31 - no change
INFO - 26: page is facing ⇧, confidence 6.04 - no change
INFO - 27: page is facing ⇧, confidence 6.82 - no change
INFO - 28: page is facing ⇧, confidence 7.70 - no change
INFO - 29: page is facing ⇧, confidence 8.18 - no change
INFO - 30: page is facing ⇧, confidence 6.42 - no change
INFO - 31: page is facing ⇧, confidence 8.42 - no change
INFO - 32: page is facing ⇧, confidence 9.55 - no change
WARNING - 12: [tesseract] unsure about page orientation
WARNING - 30: [tesseract] lots of diacritics - possibly poor OCR
WARNING - 25: [tesseract] lots of diacritics - possibly poor OCR
WARNING - 32: [tesseract] lots of diacritics - possibly poor OCR
WARNING - 31: [tesseract] lots of diacritics - possibly poor OCR
INFO - Output file is a PDF/A-2B (as expected)
Tested the output in Okular and recognition works really, really well :)
Still bothered about the tests though :)
One question: Is there also an option to rotate the page but not do OCR? It seems from the output above that this is a 2-step process. Sometimes I'd just like to rotate things correctly but don't need OCR.
--tesseract-timeout 0
will do whatever image processing (including page rotation) you want but skip OCR.
I did a search for "find_library nixos" and found this open issue.
https://github.com/NixOS/nixpkgs/issues/7307
Yikes. Basically find_library()
is broken on NixOS and you need to replace it with a hardcoded path. It probably is better to make the liblept dependency explicit and do this. This patching should be done prior to building/running compile_leptonica.py.
Can you manually run the tests on your installed environment? Meaning $packagemanager install ocrmypdf; python3 -m pytest
in a folder that contains unzipped ocrmypdf with the tests/ folder.
Well, I have to see with that. I noticed that if I skip the doCheck and hence allow it to build and install properly, all seems to work except the --deskew
option. Since I need ocr now I'll submit it already for inclusion with a note the --deskew
isn't working (yet).
That way, others could already profit from it. There was someone in the nixos irc channel that actually was interested in an OCR tool :)
Thanks for the support this far.
--remove-background
will also not work if leptonica is not available, and may some other cases. It is well worth running the test suite post install to see what works.
So, scanned in various documents today and yesterday and it's been working all fine. Still need to fix leptonica though :)
So, been trying to debug that leptonica issue.
I did now add leptonica explicitely as dependency and also changed the find_lib in your leptonica.py to a fixed path:
postPatch = ''
substituteInPlace requirements.txt \
--replace "ruffus == 2.6.3" "ruffus" \
--replace "Pillow == 4.3.0" "Pillow" \
--replace "reportlab == 3.4.0" "reportlab" \
--replace "PyPDF2 == 1.26.0" "PyPDF2" \
--replace "img2pdf == 0.2.4" "img2pdf" \
--replace "cffi == 1.11.2" "cffi"
substituteInPlace test_requirements.txt \
--replace "pytest >= 3.0" "pytest"
substituteInPlace ocrmypdf/lib/compile_leptonica.py \
--replace "©" "(c)"
substituteInPlace ocrmypdf/leptonica.py \
--replace "ffi.dlopen(find_library('lept'))" "ffi.dlopen('${leptonica}/lib/liblept.so.5')"
grep "ffi.dlopen" ocrmypdf/leptonica.py
sleep 10
export SETUPTOOLS_SCM_PRETEND_VERSION="${version}"
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
'';
buildInputs = [ glibcLocales ];
checkInputs = [ pytest pytest_xdist pytestcov setuptools_scm pytest-helpers-namespace pytestrunner ];
#doCheck = false;
propagatedBuildInputs = [
ruffus
pillow
reportlab
pypdf2
img2pdf
cffi
unpaper
ghostscript
tesseract
qpdf
leptonica
];
It outputs first
lept = ffi.dlopen('/nix/store/cz93v0lipkhk3g90gc738dd6srb76pxa-leptonica-1.74.1/lib/liblept.so.5')
so it seems to have the correct path. But it still ends at the 58 failed, 64 passed, like here: https://github.com/jbarlow83/OCRmyPDF/issues/202#issuecomment-346962096
ls -al /nix/store/cz93v0lipkhk3g90gc738dd6srb76pxa-leptonica-1.74.1/lib/liblept.so.5
lrwxrwxrwx 3 root root 16 1. Jan 1970 /nix/store/cz93v0lipkhk3g90gc738dd6srb76pxa-leptonica-1.74.1/lib/liblept.so.5 -> liblept.so.5.0.1
Althought it seems the errors are different now. I think it's better to post that wall of error output to a seperate paste: https://paste.simplylinux.ch/view/raw/3118365d
I can only see two unique errors in there.
1.
One error you're getting frequently is #140 which incidentally came up for someone else again. I can't reproduce that, but it seems like the problem is Tesseract's language packs being incompatible with the installed Tesseract binary. At a glance it's possible that Tesseract is not correctly installed on your machine or NixOS in general....
Try again with v5.4.4 since I made a change that should prevent Python from suppressing the error message here.
Specifically the message
builtins.UnicodeDecodeError('utf-8' codec can't decode **byte 0xff in position 168**: invalid start byte)
should be replaced by some kind of error output. Hopefully we'll get something more informative.
2.
The other is Leptonica, still. Here you could try switching to [API binding]((https://cffi.readthedocs.io/en/latest/overview.html#real-example-api-level-out-of-line) since perhaps we can write off CFFI ABI binding as not possible due to an open issue.
The CFFI documentation explains the difference better. In short I have a feeling that API binding will work around the pecularities of NixOS more easily, but I think ABI binding is better for most of my users so you'd have to maintain this as a patch.
I pushed a branch "leptonica-api-binding" that implements the change. Be sure to manually delete ocrmypdf/lib/_leptonica.py
and then try it out. The build inputs will need to request a C compiler + linker and leptonica's headers. If you works you can check the diff against v5.4.4 and make that part of your patching.
Will try those :) thx
Ok, v5.4.4 fixed some of the errors. I'll have a look about changinge the CFFI ABI binding.
============== 57 failed, 72 passed, 7 skipped in 117.49 seconds ===============
I just know too little about python and stuff to be of any help... can't figure out the problem.
https://paste.simplylinux.ch/view/raw/6e9c653f
Thanks for the help. As said, it works fine this far to do ocr... deskew and background aren't working but I can live fine with that.
Some basic sanity tests that are still failing. I wouldn't trust it – consider using the Docker image.
I'll close this for now since the outstanding issues are probably in the NixOS environment. Reopen if you have further questions.
Hey could you send me the nix file? (If you got it working)
https://github.com/sjau/nix-expressions/blob/master/ocrmypdf.nix
Doesn't work on current nixos anymore since underlaying packages have changed.
For future reference, ocrmypdf has been packaged and is available in regular Nixpkgs.
For future reference, ocrmypdf has been packaged and is available in regular Nixpkgs.
👍 Thanks for pointing this out. So far I've just used a container :)
Hi there
I'm currently trying to write a package file for ORCmyPDF for NixOS. I think I'm already pretty far but now I'm stuck on an error that I have no idea how to fix, as it doesn't seem to give any indication, where the problem actually occurs.
Anyway, I do get this error when it's trying to build OCRmyPDF:
The current nix expression that I use to try to build it looks like:
I understand that there seems to be a problem with one of the files but I can't figure out where the problem actually occurs.