How to add HOCR / ALTOXML ?

m-art-in commented 5 years ago

I already have HOCR and/or ALTOXML files and I don't have to generate OCR on the fly within Omeka S. How should I add my XML files to the items (or media?) that they can be used by the IiiifSearch? Thanks for help

symac commented 5 years ago

@m-art-in unfortunately, in its current version, this plugin relies on XML files generated by the ExtractOCR module that have a very specific content to identify the words and their position on the page.

m-art-in commented 5 years ago

Would kind of XML is the output of ExtractOCR module? Maybe I can convert my files. Because as I see, the ExtractOCR module can only handle PDF not JPG

symac commented 5 years ago

Here is the beginning of one of these files:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="896" width="653">
    <fontspec id="0" size="35" family="Times" color="#000000"/>
    <fontspec id="1" size="34" family="Times" color="#000000"/>
    <fontspec id="2" size="5" family="Times" color="#000000"/>
    <fontspec id="3" size="4" family="Times" color="#000000"/>
    <fontspec id="4" size="16" family="Times" color="#000000"/>
    <fontspec id="5" size="15" family="Times" color="#000000"/>
    <fontspec id="6" size="11" family="Times" color="#000000"/>
    <fontspec id="7" size="7" family="Times" color="#000000"/>
    <fontspec id="8" size="9" family="Times" color="#000000"/>
    <fontspec id="9" size="8" family="Times" color="#000000"/>
    <fontspec id="10" size="6" family="Times" color="#000000"/>
    <fontspec id="11" size="2" family="Times" color="#000000"/>
    <fontspec id="12" size="0" family="Times" color="#000000"/>
<text top="104" left="220" width="267" height="35" font="0">BORDEAUX</text>
<text top="153" left="220" width="57" height="34" font="1">ET</text>
<text top="154" left="291" width="99" height="35" font="0">SON</text>
<text top="153" left="409" width="128" height="35" font="0">PORT</text>
<text top="479" left="217" width="27" height="7" font="2">PHOTO</text>
<text top="479" left="248" width="28" height="6" font="3">Patrick</text>
<text top="479" left="279" width="27" height="7" font="2">FAE3PE</text>
<text top="541" left="225" width="113" height="17" font="4">CONTACT</text>
<text top="541" left="345" width="39" height="17" font="4">-N°</text>
<text top="540" left="393" width="37" height="17" font="4">129</text>
<text top="569" left="248" width="100" height="17" font="5">FEVRIER</text>
<text top="569" left="357" width="50" height="17" font="4">1995</text>
<text top="621" left="225" width="77" height="13" font="6">DOSSIER</text>
<text top="649" left="225" width="67" height="8" font="7">BORDEAUX</text>
<text top="649" left="296" width="49" height="9" font="7">RENOUE</text>
<text top="649" left="349" width="31" height="9" font="7">AVEC</text>
<text top="664" left="224" width="25" height="8" font="7">SON</text>
<text top="663" left="254" width="32" height="9" font="7">PORT</text>
<text top="665" left="404" width="6" height="8" font="7">p</text>
<text top="664" left="417" width="6" height="10" font="8">3</text>
<text top="692" left="225" width="39" height="8" font="7">D'HIER</text>
<text top="690" left="268" width="8" height="11" font="8">À</text>
...

m-art-in commented 5 years ago

Thank you very much! Seems like the hOCR bounding boxes. Guess I can convert this. Do you save one XML per Item, or on the level of media?

symac commented 5 years ago

The XML is saved per Item, and has to be named with the same filename (source) as the PDF it relates to (book1.pdf → book1.xml on Item 1; book2.pdf → book2.xml on another item and so on).

I don't really know what will happen if there is no PDF. It is not used by the plugin but there might be some points where their name is used, don't remember.

m-art-in commented 5 years ago

I'll test this the next days. Thank you very much for your fast and helpful replys

m-art-in commented 5 years ago

I'm sorry for the delay. I have now for testing purposes added a PDF and an XML manually to an item.

Unfortunately the request to my server gives no result. When I call the URL I get an empty result ("the") is for sure in the XML file, (IP blackened because private Testserver):

Also I don't find a possibility to set the IIIF Search URL in the settings of the IIIF server. So the Universalviewer doesn't sho the search bar either.

Any suggestions?

symac commented 5 years ago

@m-art-in regarding the option to set the iiif-search url it is now available on the current version of module on its repository. It seems it is not included in the latest release (3.5.15) but you will get it if you clone from master on this repository.

Regarding the fact that it does not return anything, I can see two things :

the module is expected to provide UniversalViewer the option to highlight the search term over the pages that have to be provided as a JPG per page. If I remember correctly, the script first loads the list of pages and then find the ones that match the search query before returning them. Based on your screenshot, you have the PDF and the XML but no JPEG, that might cause the module to crash;
you also tell from what I understand that you have added the XML manually, IIIFsearch expects it to be generated by ExtractOCR module so maybe there is an issue with the way the file is build that prevents IIIFSearch from finding anything inside it?

Basically what you need to this script to work is :

1 JPG per page;
1 PDF for the whole document, used to extract the OCR;
1 XML file containing the output from pdf2html;

When all of that is in place it should work, I have been testing it on my server for different documents and have not encountered any issue so far, so it should work for you in the end!

m-art-in commented 5 years ago

Thank you very much for the quick help! I didn't know that besides the PDF you also need the pictures. Now everything works as desired. I installed the Module IIIF-Server directly via git and could now set the option IIIF-Search.

One more question: if I create the XML file with the OCR text myself, it's probably sufficient have the pictures and the PDF is superfluous, isn't it?

symac commented 5 years ago

Great! You are right, the PDF should not be necessary if you are generating XML on your sides and uploading it to the item next to the JPG.

I use them to offer them for download so I have not tested by myself but that should be fine without them. As it is working for you I am now closing this issue.

smachefert / Omeka-S-module-IiifSearch

How to add HOCR / ALTOXML ? #4