microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.52k stars 293 forks source link

Add AWS S3 Document Storage #596

Closed koteus closed 3 months ago

koteus commented 4 months ago

Motivation and Context (Why the change? What's the scenario?)

At my job we are using our company's internal cloud storage solution which is compatible to AWS S3 API and I had to implement custom storage. With this PR I want to make it available to other users.

Another motivation is to make it available in configuration for KernelMemory service, because right now it is limited only to Azure Blobs and this is a blocker for us to use KernelMemory to asynchronously process large files.

High level description (Approach, Design)

I took a look in the code for Azure Blobs and adapted it to AWS S3. By using official nuget package AWSSDK.S3 allows to use it with AWS as long as with any compatible system (like MinIO).

Note that I marked this PR as WIP, because this is my first time contributing to Dotnet project and for sure I am missing some things like tests, configs etc, I will be glad if you help me point what is missing from my PR and I will add it.

Also feel free to suggest better names for classes, and any other comments are welcome!

koteus commented 4 months ago

@microsoft-github-policy-service agree company="Veepee"

koteus commented 3 months ago

This looks great, thank you. I fixed a few minor things in the code style. There are a few things to address:

  • When invoking async methods, the call should include .ConfigureAwait(false) to prevent await from capturing the current synchronization context and causing deadlocks
  • all async methods should support cancellation tokens, to allow shutting down the execution without delays/errors
  • for those async methods that don't run any async code, await Task.Run is not necessary, you can use return Task.CompletedTask or Task.FromResult(...)
  • when referencing local non-static vars/methods, pls remember to add the this. prefix, just for consistency with the project code style (I fixed a few, there's a few more to fix)
  • async methods' name should have an Async suffix, it's a C# convention
  • in S3Storage.cs around line 200 there's one using statement that should be turned into await using -- however, that syntax is particularly tricky, so if you get stuck, I can fix it before merging

For all the above, you should find examples in other classes, e.g. code style used in AzureBlobsStorage and PostgresDbClient might help.

All these issues are fixed now. Also I was able to build the whole KernelMemory locally and reference it from my project like this:

<!-- <PackageReference Include="Microsoft.KernelMemory.Core" Version="0.61.240519.2" /> -->
<ProjectReference Include="..\..\kernel-memory\service\Core\Core.csproj" />

and I checked on my localhost that it works and uploads files to S3.

What I will try next is to add config to KernelMemory service, so I could run it in Docker providing S3 and RabbitMQ configuration.

koteus commented 3 months ago

The

There's a couple of problems:

1. Path with duplication

I created a bucket named "kernelmemory" with this config

new AWSS3Config
{
    Auth = AWSS3Config.AuthTypes.AccessKey,
    AccessKey = "....",
    SecretAccessKey = "....",
    BucketName = "kernelmemory",
    Endpoint = "https://s3.us-west-2.amazonaws.com"
}

and uploaded some files in the default index. For some reason the bucket name is used also in the path:

image

2. Deletion not working, throwing exception

I ran example 002 with S3 and the last step, deleting files, fails with an exception:

Deleting memories derived from 04b4132592ea4d6c99b44f2faeb6b8d3202406081139323565100
Unhandled exception. Amazon.S3.AmazonS3Exception: The specified key does not exist.
 ---> Amazon.Runtime.Internal.HttpErrorResponseException: Exception of type 'Amazon.Runtime.Internal.HttpErrorResponseException' was thrown.

Both of these errors are fixed now, they were related to the fact that our company's S3 cloud is configured in such a way bucket-name.custom-domain.com/bucket-name so you have to repeat it twice. I fixed it now and I tested on my localhost with MinIO, it can now create and delete documents, also it works with my company's S3 cloud when I changed environment variables.

koteus commented 3 months ago

Hello @dluc! So I also added configuration for appsettings generation wizard. And I built the docker image and I am able to run it and use Kernel Memory as a service from my code by just swapping MemoryServerless with MemoryWebClient instance. It seems to be working fine.

I think all issues are resolved now, do you think there is anything else I could do?