virantha / pypdfocr

Python script to do PDF OCR conversion using Tesseract
Apache License 2.0
372 stars 114 forks source link

Wrong y position #27

Closed Wikunia closed 9 years ago

Wikunia commented 9 years ago

Hi,

first of all thanks for this awesome program. Unfortunately I have a problem: When I search a string inside the _ocr pdf I found the matches but the highlighted part is always around 5 cm (I know that's not the best unit :D ) under the real match. pypdfocr

virantha commented 9 years ago

Hi Ole, thanks for the feedback. I've noticed this every now and then, but nothing consistent. Do you have a pdf you could share that manifests this error so I can take a stab at tracing it down?

On Thursday, February 12, 2015, Ole Kröger notifications@github.com wrote:

Hi,

first of all thanks for this awesome program. Unfortunately I have a problem: When I search a string inside the _ocr pdf I found the matches but the highlighted part is always around 5 cm (I know that's not the best unit :D ) under the real match. [image: pypdfocr] https://cloud.githubusercontent.com/assets/4931746/6173083/89040d28-b2e5-11e4-8369-32d76272b46e.png

— Reply to this email directly or view it on GitHub https://github.com/virantha/pypdfocr/issues/27.

Wikunia commented 9 years ago

Hi I can't share the pdf with you but I will looking for another one :) Stay tuned! These are my logs and btw the cpu usage doesn't look normal...

Starting conversion of 2015.pdf
WARNING: X-dpi is 16, Y-dpi is 22, defaulting to 300
convert: unable to extent pixel cache `Cannot allocate memory' @ fatal/cache.c/CacheSignalHandler/3333.

WARNING: Could not run command convert "2015_18.jpg" -respect-parenthesis \( -clone 0 -colorspace gray -negate -lat 15x15+5\% -contrast-stretch 0 \) -compose copy_opacity -composite -opaque none +matte -modulate 100,100 -blur 1x1 -adaptive-sharpen 0x2 -negate -define morphology:compose=darken -morphology Thinning Rectangle:1x30+0+0 -negate  "2015_preprocess.jpg"
Making pool                                 
Completed conversion successfully to 2015_ocr.pdf
pypdfocr 2015.pdf  1176.63s user 16.66s system 345% cpu 5:45.50 total
Wikunia commented 9 years ago

Well it looks fine on any other pdf I tried so don't worry. The important thing is that I can now search inside the pdf! Thanks!!!

virantha commented 9 years ago

By any chance, was the "bad" one in landscape and the rest portrait orienation?

On Thu, Feb 12, 2015 at 1:28 PM, Ole Kröger notifications@github.com wrote:

Well it looks fine on any other pdf I tried so don't worry. The important thing is that I can now search inside the pdf! Thanks!!!

— Reply to this email directly or view it on GitHub https://github.com/virantha/pypdfocr/issues/27#issuecomment-74124356.

Wikunia commented 9 years ago

The "bad" one is a pdf generated generated with latex I think and it looks like an 1:1 ratio.