Some PDFs Not Properly Indexing

apennypacker-launchcg commented 2 months ago

Context / Scenario

My team and I built a small feasibility proof of concept utilizing Kernel Memory, which we use often, to index 11 documents (10 .pdf, 1 .docx) and extract some common data fields according (ex: date, location, etc.).

What happened?

After instantiating the instance of kernel memory, we were able to successfully import 9 of the 11 documents using the ImportDocumentAsync() method.

Initially, it appeared all 11 documents were indexed properly, however, 2 of the files resulted in "INFO NOT FOUND".

When investigating further, we used the GetDocumentStatusAsync("docId") method, which indicates it completed successfully.

Below is the output for reference: Completed: True Failed: False Empty: False Index: default ... RemainingSteps: [ ] CompletedSteps: [extract, partition, gen_embeddings, save_records]

All prompts return INFO NOT FOUND including: "What is the name of this document?" "What is this document about?" "What is the word count of this document?"

There does not appear to be any difference between these 2 files and the other 9 indexed correctly (size, file type, naming convention, etc. are all consistent).

Importance

I cannot use Kernel Memory

Platform, Language, Versions

Relevant technologies include: .NET 8 (C#) Microsoft.KernelMemory.Core 0.71.240820.1

Relevant log output

No response

dluc commented 2 months ago

could you try importing only those 2 files and checking the content of the vector DB? you can also check the content of document storage, to see the text extracted and all the chunks. That should give you some information to see if the docs are not handled like the other 9.

apennypacker-launchcg commented 1 month ago

@dluc Thanks for your reply. We are indexing in memory for now as this is just a proof of concept on a few small documents.

However when using the ListIndexesAsync() method on our instance of kernel memory as follows: IEnumerable indexes = await kernelMemory.ListIndexesAsync()

On the 2 files which are encountering an issue: indexes returns with a count of 0 and "Enumeration yielded no results"

On an example file which seems to be indexing correctly: indexes returns with a count of 1 and with Name[string] = "default"

dluc commented 1 month ago

One thing that could be happening is that the 2 files are not imported at all, e.g. the code is unable to open them or unable to extract the content. There are a few reasons I can think of, e.g. the text is too small, the files could be protected with DRM or other system, or the format is not recognized. I would check that all filenames have the ".pdf" extension (if they are PDF) and can be parsed by simple PDF readers, e.g. a browser, without protection. See also if with those 2 files you can copy&paste the text content, which usually means the text is easily detected.

marcominerva commented 1 month ago

@apennypacker-launchcg if you can share the two PDFs here, I can try importing them to verify what happens.

dluc commented 1 month ago

Closing for inactivity

microsoft / kernel-memory