ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

Permission error when running from browser #201

Closed dev-code-davis closed 6 years ago

dev-code-davis commented 6 years ago

Hi, basically I have created a script that launches Ocrmypdf.

$c = ('ocrmypdf -l lav --rotate-pages --pdf-renderer tesseract --output-type pdf --sidecar output.txt input.pdf output.pdf');');
exec($c, $output);
print_r($output)

When I try to call the PHP script from the server itself: php ocr.php I get the intended result.

However, when I try to open it and run from browser, I got the following permission error:

Traceback (most recent call last):
  File "/usr/local/bin/ocrmypdf", line 7, in <module>
    from ocrmypdf.__main__ import run_pipeline
  File "/usr/lib/python3.6/site-packages/ocrmypdf/__init__.py", line 3, in <module>
    import pkg_resources
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3019, in <module>
    @_call_aside
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3003, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3032, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 646, in _build_master
    ws = cls()
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 639, in __init__
    self.add_entry(entry)
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 695, in add_entry
    for dist in find_distributions(entry, True):
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2019, in find_on_path
    path_item, entry, metadata, precedence=DEVELOP_DIST
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2432, in from_location
    py_version=py_version, platform=platform, **kw
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2772, in _reload_version
    md_version = _version_from_file(self._get_metadata(self.PKG_INFO))
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2397, in _version_from_file
    line = next(iter(version_lines), '')
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2565, in _get_metadata
    for line in self.get_metadata_lines(name):
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1467, in get_metadata_lines
    return yield_lines(self.get_metadata(name))
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1463, in get_metadata
    value = self._get(self._fn(self.egg_info, name))
  File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1572, in _get
    with open(path, 'rb') as stream:
PermissionError: [Errno 13] Permission denied: '/usr/lib/python3.6/site-packages/ruffus-2.6.3-py3.6.egg-info/PKG-INFO'

/usr/local/bin/ocrmypdf

cat /usr/local/bin/ocrmypdf

#!/usr/bin/python3.6

import re
import sys

from ocrmypdf.__main__ import run_pipeline

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(run_pipeline())`

OS: Centos 7.

I'm aware that this may not be strictly OCRMYPDF related issue. But it is quite strange that I continue to get this error even when (for testing purposed) did CHMOD/CHOWN whole Python directory to more open permissions. My initial impression is that that some of those packages require higher user access?

dev-code-davis commented 6 years ago

Ok, after 2 days of relentless search, a team's devop suggested to call: setenforce 0 which seems to have worked... Some kind of centos/redhat security feature.

jbarlow83 commented 6 years ago

I don't recommend placing ocrmypdf on a public facing web server. PDF is a complex and exploitable file format, and ocrmypdf deliberately uses all available CPU and a lot of temporary storage, and is not necessarily secure against malicious PDFs.

On Nov 21, 2017 06:50, "Gugols" notifications@github.com wrote:

Ok, after 2 days of relentless search, a team's devop suggested to call: setenforce 0 which seems to have worked... Some kind of centos/redhat security feature.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/201#issuecomment-346049416, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcMyvSM0p5mwpS6yTeVd7g7hAD9khGks5s4uMYgaJpZM4Ql2zu .

dev-code-davis commented 6 years ago

@jbarlow83 It would be used in intranet where just a few selected editors will be able upload those scanned PDFs. What alternative/approach would you suggest? As to the resource usage, we could add additional server just for OCR task. Basically, we have Drupal site which uses Solr to index content. We have tackled the task of getting PDF metadata, but scanned documents still is an issue (they need to be OCRed and indexed for search purposes). I have tested a lot of OCR libraries, and to be honest - only OCRMYPDF seemed like a solid, capable solution.

jbarlow83 commented 6 years ago

It should be fine for an intranet, just not a space where people could deliberately try to break it.

Temporary storage usage is linear with the number of pages in the PDF so you can usually handle hundreds of pages before that is an issue.

On Nov 21, 2017 07:42, "Gugols" notifications@github.com wrote:

@jbarlow83 https://github.com/jbarlow83 It would be used in intranet where just a few selected editors will be able upload those scanned PDFs. What alternative/approach would you suggest? As to the resource usage, we could add additional server just for OCR task. Basically, we have Drupal site which uses Solr to index content. We have tackled the task of getting PDF metadata, but scanned documents still is an issue (they need to be OCR and indexed for search). I have tested a lot of OCR libraries, and to be honest - only OCRMYPDF seemed like a solid, capable solution.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/201#issuecomment-346066652, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcMw8lYvCdHNbP_BFpUI0OqRoIdh30ks5s4u92gaJpZM4Ql2zu .

jbarlow83 commented 6 years ago

I'll close the issue now since the main concern seemed to be a platform configuration issue. If you have further related questions feel free to reopen it.