I see that We run the pdfocr in each case (do_ocr true or false). I would think that it should be deactivated in case of do_ocr = False and at least be an option when do_ocr = True
I did little experimentation but for do_ocr= true, I have 60 better time performances when I deactivate pdfocr without apparent loss in text extraction. Moreover I can see that the text on images is extracted two times when both tikaocr and pdfocr are activated.
Isn't it better to deactivate pdfOcr by default ?
Or maybe I am missing something ?
Don't hesitate to ask me if I wasn't clear, I'll be happy to contribute if this is not something that was expected behaviour !
Hello everyone ! First of all thank you for this project, I am using it in my rag application and it is pretty cool !
Looking at the headers we send to the tika server:
I see that We run the pdfocr in each case (do_ocr true or false). I would think that it should be deactivated in case of do_ocr = False and at least be an option when do_ocr = True I did little experimentation but for do_ocr= true, I have 60 better time performances when I deactivate pdfocr without apparent loss in text extraction. Moreover I can see that the text on images is extracted two times when both tikaocr and pdfocr are activated. Isn't it better to deactivate pdfOcr by default ?
Or maybe I am missing something ?
Don't hesitate to ask me if I wasn't clear, I'll be happy to contribute if this is not something that was expected behaviour !
Paul