Closed GryBsh closed 5 months ago
hi @GryBsh the solution supports Azure Form Recognizer, now known as Azure AI Document Intelligence.
I also took this issue to our MS account team and I got the same answer. Let me tell you what I told them: Read you're own code: https://github.com/microsoft/kernel-memory/blob/0b8e4cc5592000096f39d80fce1302d24e9e9b39/service/Core/DataFormats/Pdf/PdfDecoder.cs
No, the solution does NOT support OCR of any kind on PDFs. The assumption is made the PDFs have already been OCR'd well. So, I don't think that "completed" tag is very accurate .
Bump
@Matt-Scheetz - You can examine this project to see how to integrate tesseract into kernel-memory: https://github.com/microsoft/chat-copilot
Otherwise, Azure Forms Recognizer is supported if you add the configuration: https://github.com/microsoft/kernel-memory/blob/main/service/Service/appsettings.json#L338
@GryBsh sorry about the misunderstanding. What I meant to say is that KM has integrated Azure Form Recognizer as an optional OCR solution, however, the integration is used only for images. For PDF KM always uses UglyToad.PdfPig
, which is free and was added earlier if I remember correctly.
In order to use Azure Form Recognizer we'll need to make "PDF extraction" configurable, allowing to choose between Azure Doc Intelligence, UglyToad.PdfPig, or any other injectable class. It would be a nice feature to have, though currently we don't have a timeline for it. If someone is willing to work on it and send a PR it would definitely be welcome.
I'd like to be able to opt-in enable OCRing PDF documents. I understand that tesseract doesn't support this, but Form Recognizer does.