microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.6k stars 308 forks source link

Feature Request: Allow enable OCR extraction from PDF #91

Closed GryBsh closed 5 months ago

GryBsh commented 1 year ago

I'd like to be able to opt-in enable OCRing PDF documents. I understand that tesseract doesn't support this, but Form Recognizer does.

dluc commented 10 months ago

hi @GryBsh the solution supports Azure Form Recognizer, now known as Azure AI Document Intelligence.

GryBsh commented 10 months ago

I also took this issue to our MS account team and I got the same answer. Let me tell you what I told them: Read you're own code: https://github.com/microsoft/kernel-memory/blob/0b8e4cc5592000096f39d80fce1302d24e9e9b39/service/Core/DataFormats/Pdf/PdfDecoder.cs

No, the solution does NOT support OCR of any kind on PDFs. The assumption is made the PDFs have already been OCR'd well. So, I don't think that "completed" tag is very accurate .

Matt-Scheetz commented 8 months ago

Bump

crickman commented 8 months ago

@Matt-Scheetz - You can examine this project to see how to integrate tesseract into kernel-memory: https://github.com/microsoft/chat-copilot

Otherwise, Azure Forms Recognizer is supported if you add the configuration: https://github.com/microsoft/kernel-memory/blob/main/service/Service/appsettings.json#L338

dluc commented 8 months ago

@GryBsh sorry about the misunderstanding. What I meant to say is that KM has integrated Azure Form Recognizer as an optional OCR solution, however, the integration is used only for images. For PDF KM always uses UglyToad.PdfPig, which is free and was added earlier if I remember correctly.

In order to use Azure Form Recognizer we'll need to make "PDF extraction" configurable, allowing to choose between Azure Doc Intelligence, UglyToad.PdfPig, or any other injectable class. It would be a nice feature to have, though currently we don't have a timeline for it. If someone is willing to work on it and send a PR it would definitely be welcome.