microsoft / chat-copilot

MIT License
1.96k stars 664 forks source link

Bulk bring your own data #683

Open vicentegarciadiez opened 8 months ago

vicentegarciadiez commented 8 months ago

Hi team, is there any way of bring my own data to a chat but in a massive way?

I mean, I want to load lots of PDF files to a chat to ask questions about them, but there're lots of limitations like 10 files per time or size limits.

Thanks in advance.

crickman commented 8 months ago

Hi @vicentegarciadiez, great question. There is a tool for importing outside of the application, but we've recently discovered it doesn't function outside of the dev-environment:

https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument

The kernel-memory repo has all the machinery in place for you to take matters in your own hands. All you need to be able to do is post documents to the same queue and blob store that your chat-copilot is configured for (See KernelMemory section of appsettings.json for webapi.

The most concise expression of what this might resemble and be viewed @ https://github.com/microsoft/kernel-memory/blob/main/service/Service/Program.cs. (Although you could run yours as a console application.)

  1. Create an IKernelMemory instance (note, this won't be needing any handlers).

    image
  2. Call memory.ImportDocumentAsync() for each document you want to import. (This will upload the document to the blob store and create a queue message...the chat-copilot memorypipeline will do the actual processing.)

image
vicentegarciadiez commented 8 months ago

Thanks @crickman for your answer! But I've a question, in your example, mydocument.docx will be available to all chats or only to a selected chat?

Best regards.

crickman commented 7 months ago

Right...good point...I've erroneaously ommitted those details.

This would be a more complete expression (with some of the values expanded as literals):

        string documentId = [The id of the cosmosdb `ChatMemorySource` entity];
        string fileName = ...
        Stream fileContent  = ...

        var uploadRequest =
            new DocumentUploadRequest
            {
                DocumentId = documentId,
                Files = new List<DocumentUploadRequest.UploadedFile> { new(fileName, fileContent) },
                Index = "chatmemory",
                Steps = new List<string>() { "extract", "partition", "gen_embeddings", "save_embeddings" },
            };

        uploadRequest.Tags.Add("chatid", "00000000-0000-0000-0000-000000000000"); // Global document.  Replace with chat-id to associate with a single chat
        uploadRequest.Tags.Add("memory", "DocumentMemory");

        await memoryClient.ImportDocumentAsync(uploadRequest, cancellationToken);

The related code in CC is: https://github.com/microsoft/chat-copilot/blob/main/webapi/Extensions/ISemanticMemoryClientExtensions.cs

The code for accessing CosmosDB data is: https://github.com/microsoft/chat-copilot/tree/main/webapi/Storage

vicentegarciadiez commented 7 months ago

Thanks @crickman and do you know how the images inside a document are indexed? I mean, is the ocr processing those images?

Thanks in advance.

crickman commented 7 months ago

I do not belive images are processed using OCR for docx., pptx, xslx, or pdf.

I have sometimes used extrnal tools to convert documents with complex structure (to text) and then upload the text result. Azure Forms Recognizer has some options for more complex document parsing also.