internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
40 stars 9 forks source link

Non-integer confidences cause error parsing #8

Open whikloj opened 9 months ago

whikloj commented 9 months ago

If your confidence is not a whole number then parsing it throws an Exception at line 186 of parse.py

Traceback (most recent call last):
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/bin/recode_pdf", line 302, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 640, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf
    word_data = hocr_page_to_word_data(hocr_page, font_scaler)
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/env/lib/python3.9/site-packages/hocr/parse.py", line 186, in hocr_page_to_word_data
    conf = int(m.group(1).split()[0])
ValueError: invalid literal for int() with base 10: '0.988'

Code that offends is.

conf = int(m.group(1).split()[0])

You can also just test this with any old python.

> python3 
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(int("99"))
99
>>> print(int("99.9"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '99.9'

Solution is to convert to float() first.

conf = int(float(m.group(1).split()[0]))
>>> print(int(float("99.9")))
99
>>> print(int(float("99")))
99

or perhaps use the float instead of an integer

MerlijnWajer commented 6 months ago

Apologies, I thought I had already commented on this. It looks like the spec indeed allows for floating points. I guess we should maybe get rid of the int() call and just replace it with float(), unless the code promised the confidence was an integer. The documentation just says word confidence, 0 - 100, so I suppose we could probably go with a float.