microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.61k stars 309 forks source link

Are `.doc` files supported? #677

Closed gmantri closed 4 months ago

gmantri commented 4 months ago

Context / Scenario

We have some Microsoft Word documents that are in old format (.doc). What we are seeing is that when try to use those documents, Kernel Memory fails to answer the questions from those documents. When we convert those documents to .docx format, everything works great.

microsoft.docx

What happened?

Our expectation was that both .doc and .docx files should work but that is not happening. .doc files do not work but .docx file work.

Importance

a fix would make my life easier

Platform, Language, Versions

Microsoft.KernelMemory.Core - 0.61.240524.1 Microsoft.SemanticKernel - 1.15.0

Relevant log output

No response

dluc commented 4 months ago

Hi @gmantri the old .doc format is not supported sorry. Aside from converting files manually, you could:

gmantri commented 4 months ago

@dluc - Thanks. Is there a list of file types supported by Kernel Memory. All I could find was this: https://github.com/microsoft/kernel-memory?tab=readme-ov-file#kernel-memory-km-and-sk-semantic-memory-sm and it only talks about the file types at a high level (e.g. Word instead of .docx and not .doc). Having this list will be really helpful.

dluc commented 4 months ago

The default list can be extrapolated from here https://github.com/microsoft/kernel-memory/blob/3d34260ae513af48030da9a56aa50b8e0162c6f8/service/Core/DataFormats/DependencyInjection.cs#L81

        services.AddSingleton<IContentDecoder, TextDecoder>();
        services.AddSingleton<IContentDecoder, MarkDownDecoder>();
        services.AddSingleton<IContentDecoder, HtmlDecoder>();
        services.AddSingleton<IContentDecoder, PdfDecoder>();
        services.AddSingleton<IContentDecoder, ImageDecoder>();
        services.AddSingleton<IContentDecoder, MsExcelDecoder>();
        services.AddSingleton<IContentDecoder, MsPowerPointDecoder>();
        services.AddSingleton<IContentDecoder, MsWordDecoder>();

using DI one can inject more decoders, that are automatically picked up by TextExtractionHandler (https://github.com/microsoft/kernel-memory/blob/3d34260ae513af48030da9a56aa50b8e0162c6f8/service/Core/Handlers/TextExtractionHandler.cs#L43)

For each file, the handler loops through the list of decoders, asking each one if they support the current file format:

var decoder = this._decoders.LastOrDefault(d => d.SupportsMimeType(uploadedFile.MimeType));

if (decoder is not null) ...
gmantri commented 4 months ago

Thank you!