smachefert / Omeka-S-module-IiifSearch

IIIF Search is a module for Omeka S that add IIIF Search Api for ocr content.
GNU General Public License v3.0
5 stars 3 forks source link

How to add HOCR / ALTOXML ? #4

Closed m-art-in closed 5 years ago

m-art-in commented 5 years ago

I already have HOCR and/or ALTOXML files and I don't have to generate OCR on the fly within Omeka S. How should I add my XML files to the items (or media?) that they can be used by the IiiifSearch? Thanks for help

symac commented 5 years ago

@m-art-in unfortunately, in its current version, this plugin relies on XML files generated by the ExtractOCR module that have a very specific content to identify the words and their position on the page.

m-art-in commented 5 years ago

Would kind of XML is the output of ExtractOCR module? Maybe I can convert my files. Because as I see, the ExtractOCR module can only handle PDF not JPG

symac commented 5 years ago

Here is the beginning of one of these files:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="896" width="653">
    <fontspec id="0" size="35" family="Times" color="#000000"/>
    <fontspec id="1" size="34" family="Times" color="#000000"/>
    <fontspec id="2" size="5" family="Times" color="#000000"/>
    <fontspec id="3" size="4" family="Times" color="#000000"/>
    <fontspec id="4" size="16" family="Times" color="#000000"/>
    <fontspec id="5" size="15" family="Times" color="#000000"/>
    <fontspec id="6" size="11" family="Times" color="#000000"/>
    <fontspec id="7" size="7" family="Times" color="#000000"/>
    <fontspec id="8" size="9" family="Times" color="#000000"/>
    <fontspec id="9" size="8" family="Times" color="#000000"/>
    <fontspec id="10" size="6" family="Times" color="#000000"/>
    <fontspec id="11" size="2" family="Times" color="#000000"/>
    <fontspec id="12" size="0" family="Times" color="#000000"/>
<text top="104" left="220" width="267" height="35" font="0">BORDEAUX</text>
<text top="153" left="220" width="57" height="34" font="1">ET</text>
<text top="154" left="291" width="99" height="35" font="0">SON</text>
<text top="153" left="409" width="128" height="35" font="0">PORT</text>
<text top="479" left="217" width="27" height="7" font="2">PHOTO</text>
<text top="479" left="248" width="28" height="6" font="3">Patrick</text>
<text top="479" left="279" width="27" height="7" font="2">FAE3PE</text>
<text top="541" left="225" width="113" height="17" font="4">CONTACT</text>
<text top="541" left="345" width="39" height="17" font="4">-N°</text>
<text top="540" left="393" width="37" height="17" font="4">129</text>
<text top="569" left="248" width="100" height="17" font="5">FEVRIER</text>
<text top="569" left="357" width="50" height="17" font="4">1995</text>
<text top="621" left="225" width="77" height="13" font="6">DOSSIER</text>
<text top="649" left="225" width="67" height="8" font="7">BORDEAUX</text>
<text top="649" left="296" width="49" height="9" font="7">RENOUE</text>
<text top="649" left="349" width="31" height="9" font="7">AVEC</text>
<text top="664" left="224" width="25" height="8" font="7">SON</text>
<text top="663" left="254" width="32" height="9" font="7">PORT</text>
<text top="665" left="404" width="6" height="8" font="7">p</text>
<text top="664" left="417" width="6" height="10" font="8">3</text>
<text top="692" left="225" width="39" height="8" font="7">D'HIER</text>
<text top="690" left="268" width="8" height="11" font="8">À</text>
...
m-art-in commented 5 years ago

Thank you very much! Seems like the hOCR bounding boxes. Guess I can convert this. Do you save one XML per Item, or on the level of media?

symac commented 5 years ago

The XML is saved per Item, and has to be named with the same filename (source) as the PDF it relates to (book1.pdf → book1.xml on Item 1; book2.pdf → book2.xml on another item and so on).

I don't really know what will happen if there is no PDF. It is not used by the plugin but there might be some points where their name is used, don't remember.

m-art-in commented 5 years ago

I'll test this the next days. Thank you very much for your fast and helpful replys

m-art-in commented 5 years ago

I'm sorry for the delay. I have now for testing purposes added a PDF and an XML manually to an item. image

Unfortunately the request to my server gives no result. When I call the URL I get an empty result ("the") is for sure in the XML file, (IP blackened because private Testserver):

image

Also I don't find a possibility to set the IIIF Search URL in the settings of the IIIF server. So the Universalviewer doesn't sho the search bar either. image

Any suggestions?

symac commented 5 years ago

@m-art-in regarding the option to set the iiif-search url it is now available on the current version of module on its repository. It seems it is not included in the latest release (3.5.15) but you will get it if you clone from master on this repository.

Regarding the fact that it does not return anything, I can see two things :

Basically what you need to this script to work is :

When all of that is in place it should work, I have been testing it on my server for different documents and have not encountered any issue so far, so it should work for you in the end!

m-art-in commented 5 years ago

Thank you very much for the quick help! I didn't know that besides the PDF you also need the pictures. Now everything works as desired. I installed the Module IIIF-Server directly via git and could now set the option IIIF-Search.

One more question: if I create the XML file with the OCR text myself, it's probably sufficient have the pictures and the PDF is superfluous, isn't it?

symac commented 5 years ago

Great! You are right, the PDF should not be necessary if you are generating XML on your sides and uploading it to the item next to the JPG.

I use them to offer them for download so I have not tested by myself but that should be fine without them. As it is working for you I am now closing this issue.