nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.02k stars 141 forks source link

Tika server is running OCR twice ? #76

Open le-codeur-rapide opened 1 month ago

le-codeur-rapide commented 1 month ago

Hello everyone ! First of all thank you for this project, I am using it in my rag application and it is pretty cool !

Looking at the headers we send to the tika server:

def parse_to_html(self, filepath, do_ocr=False):
    # Turn off OCR by default
    timeout = 3000
    headers = {
        "X-Tika-OCRskipOcr": "true"
    }
    if do_ocr:
        headers = {
            "X-Tika-OCRskipOcr": "false",
            "X-Tika-OCRoutputType": "hocr",
            "X-Tika-Timeout-Millis": str(100 * timeout),
            "X-Tika-PDFOcrStrategy": "ocr_only",
            "X-Tika-OCRtimeoutSeconds": str(timeout),
        }

    if ensure_bool(os.environ.get("TIKA_OCR", False)):
        headers = None
    return parser.from_file(filepath, xmlContent=True, requestOptions={'headers': headers, 'timeout': timeout}),` 

I see that We run the pdfocr in each case (do_ocr true or false). I would think that it should be deactivated in case of do_ocr = False and at least be an option when do_ocr = True I did little experimentation but for do_ocr= true, I have 60 better time performances when I deactivate pdfocr without apparent loss in text extraction. Moreover I can see that the text on images is extracted two times when both tikaocr and pdfocr are activated. Isn't it better to deactivate pdfOcr by default ?

Or maybe I am missing something ?

Don't hesitate to ask me if I wasn't clear, I'll be happy to contribute if this is not something that was expected behaviour !

Paul