pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.85k stars 925 forks source link

New hOCR renderer renders duplicate HTML IDs #832

Open slbayer opened 1 year ago

slbayer commented 1 year ago

Bug report

Thanks for finding the bug! To help us fix it, please make sure that you include the following information:

I'm loving the new hOCR renderer for extracted text output. One problem I'm observing is that the HTML id elements are not unique. The ids are unique among ocr_pages, and within each page among ocr_blocks, but that's not how HTML ids work - they should be unique within the file. I'd recommend something like <div class='ocr_page' id='page_2' ...> and <div class='ocr_block' id='block_2_1'...>, where the first integer is the page number and the second is the number of the block within the page.

slbayer commented 1 year ago

E.g., in release 20221105, in converter.py, line 947, change

"<div class='ocr_page' id='%s' title='%s'>\n"

to

"<div class='ocr_page' id='page_%s' title='%s'>\n"

and at line 962 - 3, change

                  "<div class='ocr_block' id='%d' title='%s'>\n"
                    % (item.index, self.bbox_repr(item.bbox))

to

                  "<div class='ocr_block' id='block_%s_%d' title='%s'>\n"
                    % (ltpage.pageid, item.index, self.bbox_repr(item.bbox))