openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Filter tesseract log lines in orientation detection #27

Closed voyageur closed 9 years ago

voyageur commented 9 years ago

With OpenCL-enabled tesseract, output has some additional lines, including some without ":" in them. This filters them out before looking for the orientation line (else pyocr returns a "no scripts detected")

Sample output:

Tesseract Open Source OCR Engine v3.03 with Leptonica
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.

[DS] Device: "GeForce GTX 970" (OpenCL) evaluation...
[DS] Device: "GeForce GTX 970" (OpenCL) evaluated
[DS]          composeRGBPixel: 0.013045 (w=1.2)
[DS]            HistogramRect: 0.011213 (w=2.4)
[DS]       ThresholdRectToPix: 0.006863 (w=4.5)
[DS]        getLineMasksMorph: 0.003713 (w=5.0)
[DS]                    Score: 0.092011

[DS] Device: "(null)" (Native) evaluation...
[DS] Device: "(null)" (Native) evaluated
[DS]          composeRGBPixel: 0.013465 (w=1.2)
[DS]            HistogramRect: 0.056878 (w=2.4)
[DS]       ThresholdRectToPix: 0.016565 (w=4.5)
[DS]        getLineMasksMorph: 0.109072 (w=5.0)
[DS]                    Score: 0.772569
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 1:GeForce GTX 970 score is 0.092011
[DS] Device[2] 0:(null) score is 0.772569
[DS] Selected Device[1]: "GeForce GTX 970" (OpenCL)
Orientation: 0
Orientation in degrees: 0
Orientation confidence: 16.75
Script: 1
Script confidence: 11.35
jflesch commented 9 years ago

Doesn't that break image_to_string() as well ?

voyageur commented 9 years ago

That one worked fine, the exit status is correct so the additional output does not matter (and it does not get in the outputbase files). Tested on a few paperwork scans :)

jflesch commented 9 years ago

Ok for me. Thanks for the fix :)