Open vicentegarciadiez opened 8 months ago
Hi @vicentegarciadiez, great question. There is a tool for importing outside of the application, but we've recently discovered it doesn't function outside of the dev-environment:
https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument
The kernel-memory repo has all the machinery in place for you to take matters in your own hands. All you need to be able to do is post documents to the same queue and blob store that your chat-copilot is configured for (See KernelMemory
section of appsettings.json
for webapi.
The most concise expression of what this might resemble and be viewed @ https://github.com/microsoft/kernel-memory/blob/main/service/Service/Program.cs. (Although you could run yours as a console application.)
Create an IKernelMemory
instance (note, this won't be needing any handlers).
Call memory.ImportDocumentAsync()
for each document you want to import. (This will upload the document to the blob store and create a queue message...the chat-copilot memorypipeline will do the actual processing.)
Thanks @crickman for your answer! But I've a question, in your example, mydocument.docx will be available to all chats or only to a selected chat?
Best regards.
Right...good point...I've erroneaously ommitted those details.
This would be a more complete expression (with some of the values expanded as literals):
string documentId = [The id of the cosmosdb `ChatMemorySource` entity];
string fileName = ...
Stream fileContent = ...
var uploadRequest =
new DocumentUploadRequest
{
DocumentId = documentId,
Files = new List<DocumentUploadRequest.UploadedFile> { new(fileName, fileContent) },
Index = "chatmemory",
Steps = new List<string>() { "extract", "partition", "gen_embeddings", "save_embeddings" },
};
uploadRequest.Tags.Add("chatid", "00000000-0000-0000-0000-000000000000"); // Global document. Replace with chat-id to associate with a single chat
uploadRequest.Tags.Add("memory", "DocumentMemory");
await memoryClient.ImportDocumentAsync(uploadRequest, cancellationToken);
The related code in CC is: https://github.com/microsoft/chat-copilot/blob/main/webapi/Extensions/ISemanticMemoryClientExtensions.cs
The code for accessing CosmosDB data is: https://github.com/microsoft/chat-copilot/tree/main/webapi/Storage
Thanks @crickman and do you know how the images inside a document are indexed? I mean, is the ocr processing those images?
Thanks in advance.
I do not belive images are processed using OCR for docx., pptx, xslx, or pdf.
I have sometimes used extrnal tools to convert documents with complex structure (to text) and then upload the text result. Azure Forms Recognizer has some options for more complex document parsing also.
Hi team, is there any way of bring my own data to a chat but in a massive way?
I mean, I want to load lots of PDF files to a chat to ask questions about them, but there're lots of limitations like 10 files per time or size limits.
Thanks in advance.