Open ddddavidmartin opened 6 years ago
Having a look through the pyocr sources this stands out to me:
src/pyocr/builders.py
307- file_ext = ["txt"]
308: tess_flags = ["-psm", str(tesseract_layout)]
309- cun_args = ["-f", "text"]
--
564- file_ext = ["html", "hocr"]
565: tess_flags = ["-psm", str(tesseract_layout)]
566- tess_conf = ["hocr"]
--
640- file_ext = ["html", "hocr"]
641: tess_flags = ["-psm", str(tesseract_layout)]
642- tess_conf = ["hocr"]
Does pyocr just use -psm
instead of --psm
as the parameter? I'm wondering whether that is not accepted anymore now.
Does pyocr just use -psm instead of --psm as the parameter? I'm wondering whether that is not accepted anymore now.
It looks like this is the problem. I have changed the passed options in builds.py
to provide --psm
instead of -psm
and it works fine now. I might create a pull request for this though I'm not sure whether there are any other implications of this.
The commit in question in tesseract is the following: https://github.com/tesseract-ocr/tesseract/commit/ee201e1f4fa277a4b2ecd751a45d3bf1eba6dfdb
I also came across this today. I note that -psm is used not just in builders.py but also in tesseract.py.
I haven't had a chance yet to work out the circular import statements that I introduced in https://github.com/ddddavidmartin/pyocr/tree/update_deprecated_psm_option_string. If anyone wants to step in, feel free to give it a go.
For now, a quick and dirty fix is to just apply https://github.com/openpaperwork/pyocr/pull/100/commits/c136838b46cf49f06ac1dc5f2f9bc16232c11213.
Good day,
I'm using pyocr through Paperless on a Ubuntu setup. I'm using the tesseract-ocr PPA [0] and on the latest version [1] pyocr throws an error.
[0]
[1]
Traceback:
Has anyone else come across this? Thanks!