ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
359 stars 78 forks source link

hocr-cut gives error #154

Open sarangtc opened 4 years ago

sarangtc commented 4 years ago

hocr-cut.py gives the following error:

Traceback (most recent call last): File "../hocr-cut.py", line 48, in filename = os.path.join(os.path.dirname(args.file), filename) File "/usr/lib/python2.7/posixpath.py", line 68, in join if b.startswith('/'): AttributeError: 'NoneType' object has no attribute 'startswith'

zuphilip commented 4 years ago

The message refers to line 48:

https://github.com/tmbdev/hocr-tools/blob/b3e380779e5c88ad99dca2a6b8b292c0f375fd68/hocr-cut#L48

What is the exact call of hocr-cut you are doing? Can you share the hocr file here?

sarangtc commented 4 years ago

Hi,

Sorry. missed your message.

These are the full details:

I installed hocr-tools on ubuntu-16.04 using: sudo pip install hocr-tools

although hocr-pdf works hocr-cut command gave: hocr-cut: command not found

so I copied the code from the github to /usr/local/bin/hocr-cut and made it executable

in my home user folder (where hocr-pdf works), I ran the command: hocr-cut test_0012.hocr "test_0012.hocr" file is attached for reference the output was:

Traceback (most recent call last): File "/usr/local/bin/hocr-cut", line 48, in filename = os.path.join(os.path.dirname(args.file), filename) File "/usr/lib/python2.7/posixpath.py", line 68, in join if b.startswith('/'):AttributeError: 'NoneType' object has no attribute 'startswith'

I tried on various 2 columned hocr files, but all gave the same error message.

On Thu, Aug 29, 2019 at 2:17 PM Philipp Zumstein notifications@github.com wrote:

The message refers to line 48:

https://github.com/tmbdev/hocr-tools/blob/b3e380779e5c88ad99dca2a6b8b292c0f375fd68/hocr-cut#L48

What is the exact call of hocr-cut you are doing? Can you share the hocr file here?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tmbdev/hocr-tools/issues/154?email_source=notifications&email_token=AMP46HKLE3WD4EZL5QLESCTQG6ELHA5CNFSM4IRO37VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5NX5KI#issuecomment-526089897, or mute the thread https://github.com/notifications/unsubscribe-auth/AMP46HJQMH5HPBHV6KQA6H3QG6ELHANCNFSM4IRO37VA .

zuphilip commented 4 years ago

The pip package is not up-to-date and therefore hocr-cut is not found in the beginning. Try instead

pip install git+https://github.com/tmbdev/hocr-tools.git

However, I am not sure this will solve your problems...

Your example file is not attached here to this issue in GitHub (I guess that this does not work when you attach it to the email only). Can you upload it directly to this issue in GitHub? Or upload it e.g. at https://pastebin.com/ and give the link here.

sarangtc commented 4 years ago

here is the file test_0012.txt

zuphilip commented 4 years ago

Okay, I see that you don't have specified the image in your hocr file on line 13. Try to adapt this line to something like

<div class='ocr_page' lang='unknown' title='image IMAGENAME.PNG; bbox 0 0 6169 4648'>

where you should replace IMAGENAME.PNG with the name of your image file. Does that work?

(We can try to make a better error message for this.)

sarangtc commented 4 years ago

ok, that worked, it gave me a myimage.left.jpg and myimage.right.jpg I was primarily expecting two hocr files, one for each half (later to be merged with the images to make the hocr-pdf)

I assumed this from the description: Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns

I guess you meant the image itself and not the hocr file !!!