Kernel Memory on kubernetes

rmelilloii commented 1 month ago

Context / Scenario

Hello, good morning/afternoon/evening and Happy Monday!

Initially, I must say that I am very impressed with this solution and keen to implement it as an internal service to our k8s clusters.

Question

To my question: I went through all the documentation available. As of now, is it possible to deploy it to kubernetes?

Thanks!

dluc commented 1 month ago

hi @rmelilloii great to hear the solution could help!

KM should totally work fine in Kubernetes. I would try first with the docker image mentioned in the main README. Configuration can be provided via a file or env vars, let me know if you encounter any problem.

Aside for the basic docker image, there are also optimizations, for example it's possible to turn on/off various aspects of KM, for example you could run ingestion workers on 10 VMs while running the web service only on 2-3 nodes, if that's something that could interest.

rmelilloii commented 1 month ago

Hello @dluc good morning and happy Tuesday! Thanks for your message. I will then read the "service/Service/appsettings.json" to figure out which variables I can inform on my yaml.

My initial stateful deployment should be something like:

Document Storage: local disk
Memory Storage: ElasticSearch
Data ingestion orchestration: RabbitMQ
LLM: OpenAI (latest GPT)

I am indeed interested in the different workloads for better resource utilisation. To avoid too much noise here I will do tests and post back with results/doubts to help similar needs, maybe adding working examples to the repo.

Thanks!

rmelilloii commented 1 month ago

Hello again o/ @dluc , sorry to bother.

Deploy goes green (pod running) but it is on a crash loop. The pod log is not enough to help with a cause. Any suggestion? It complains about: "DataIngestion.EmbeddingGeneratorTypes", which has no value on the documentation.

Elasticsearch and RabbiMQ endpoints are up and accepting connections (Auth validated).

Any help is very appreciated.

Thanks!

Log:

******
Data ingestion embedding generation (DataIngestion.EmbeddingGeneratorTypes) is not configured.
Please configure the service and retry.

How to configure the service:

1. Set the ASPNETCORE_ENVIRONMENT env var to "Development" or "Production".

   Current value: Development

Documentation:

manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: msft-km
  labels:
    service: msft-km
spec:
  replicas: 1
  selector:
    matchLabels:
      service: msft-km
  template:
    metadata:
      labels:
        service: msft-km
    spec:
      containers:
      - name: msft-km
        image: kernelmemory/service:latest
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 9001
          protocol: TCP
        - name: https
          containerPort: 9002
          protocol: TCP
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Development"
        # Whether to run the web service that allows to upload files and search memory
        # Use these booleans to deploy the web service and the handlers on same/different VMs
        - name: KernelMemory__Service__RunWebService
          value: "true"

        # Whether to run the asynchronous pipeline handlers
        # Use these booleans to deploy the web service and the handlers on same/different VMs
        - name: KernelMemory__Service__RunHandlers
          value: "true"

        # Whether to expose OpenAPI swagger UI at http://127.0.0.1:9001/swagger/index.html
        - name: KernelMemory__Service__OpenApiEnabled
          value: "false"

        # Whether clients must provide some credentials to interact with the HTTP API
        - name: KernelMemory__ServiceAuthorization__Enabled
          value: "false"

        # Currently "APIKey" is the only type supported
        - name: KernelMemory__ServiceAuthorization__AuthenticationType
          value: "APIKey"

        # HTTP header name to check
        - name: KernelMemory__ServiceAuthorization__HttpHeaderName
          value: "Authorization"

        # Define two separate API Keys, to allow key rotation. Both are active.
        # Keys must be different and case-sensitive, and at least 32 chars long.
        # Contain only alphanumeric chars and allowed symbols.
        # Symbols allowed: . _ - (dot, underscore, minus).
        - name: KernelMemory__ServiceAuthorization__AccessKey1
          value: "***"
        - name: KernelMemory__ServiceAuthorization__AccessKey2
          value: "***"

        # "AzureBlobs" or "SimpleFileStorage"
        - name: KernelMemory__ContentStorageType
          value: "SimpleFileStorage"

        # "AzureOpenAIText", "OpenAI" or "LlamaSharp"
        - name: KernelMemory__TextGeneratorType
          value: "OpenAI"

        # "AzureOpenAIText", "OpenAI" or "LlamaSharp"
        - name: KernelMemory__DefaultIndexName
          value: "noName"

        # - InProcess: in process .NET orchestrator, synchronous/no queues
        # - Distributed: asynchronous queue based orchestrator
        - name: KernelMemory__DataIngestion__OrchestrationType
          value: "Distributed"

        # "AzureQueue", "RabbitMQ", "SimpleQueues"
        - name: KernelMemory__DataIngestion__DistributedOrchestration__QueueType
          value: "RabbitMQ"

        # Whether the pipeline generates and saves the vectors/embeddings in the memory DBs.
        # When using a memory DB that automatically generates embeddings internally,
        # or performs semantic search internally anyway, this should be False,
        # and avoid generating embeddings that are not used.
        # Examples:
        # * you are using Azure AI Search "semantic search" without "vector search": in this
        #   case you don't need embeddings because Azure AI Search uses a more advanced approach
        #   internally.
        # * you are using a custom Memory DB connector that generates embeddings on the fly
        #   when writing records and when searching: in this case you don't need the pipeline
        #   to calculate embeddings, because your connector does all the work.
        # * you are using a basic "text search" and a DB without "vector search": in this case
        #   embeddings would be unused, so it's better to disable them to save cost and latency.
        - name: KernelMemory__DataIngestion__EmbeddingGenerationEnabled
          value: "true"

        # Vectors can be written to multiple storages, e.g. for data migration, A/B testing, etc.
        # "AzureAISearch", "Qdrant", "SimpleVectorDb"
        - name: KernelMemory__DataIngestion__MemoryDbTypes__0
          value: "Elasticsearch"

        # "None" or "AzureAIDocIntel"
        - name: KernelMemory__DataIngestion__ImageOcrType
          value: "None"

        # "AzureOpenAIEmbedding" or "OpenAI"
        # This is the generator registered for `ITextEmbeddingGeneration` dependency injection.
        - name: KernelMemory__Retrieval__EmbeddingGeneratorType
          value: "OpenAI"

        # "AzureAISearch", "Qdrant", "SimpleVectorDb"
        - name: KernelMemory__Retrieval__MemoryDbType
          value: "Elasticsearch"

        # Maximum number of tokens accepted by the LLM used to generate answers.
        # The number includes the tokens used for the answer, e.g. when using
        # GPT4-32k, set this number to 32768.
        # If the value is not set or less than one, SearchClient will use the
        # max amount of tokens supported by the model in use.
        - name: KernelMemory__Retrieval__SearchClient__MaxAskPromptSize
          value: "-1"

        # Maximum number of relevant sources to consider when generating an answer.
        # The value is also used as the max number of results returned by SearchAsync
        # when passing a limit less or equal to zero.
        - name: KernelMemory__Retrieval__SearchClient__MaxMatchesCount
          value: "100"

        # How many tokens to reserve for the answer generated by the LLM.
        # E.g. if the LLM supports max 4000 tokens, and AnswerTokens is 300, then
        # the prompt sent to LLM will contain max 3700 tokens, composed by
        # prompt + question + grounding information retrieved from memory.
        - name: KernelMemory__Retrieval__SearchClient__AnswerTokens
          value: "300"

        # Text to return when the LLM cannot produce an answer.
        - name: KernelMemory__Retrieval__SearchClient__EmptyAnswer
          value: "INFO NOT FOUND"

        # Options: "Disk" or "Volatile". Volatile data is lost after each execution.
        - name: KernelMemory__Services__SimpleFileStorage__StorageType
          value: "Volatile"

        # Directory where files are stored.
        - name: KernelMemory__Services__SimpleFileStorage__Directory
          value: "_files"

        # Options: "Disk" or "Volatile". Volatile data is lost after each execution.
        - name: KernelMemory__Services__SimpleQueues__StorageType
          value: "Volatile"

        # Directory where files are stored.
        - name: KernelMemory__Services__SimpleQueues__Directory
          value: "_queues"

        # Options: "Disk" or "Volatile". Volatile data is lost after each execution.
        - name: KernelMemory__Services__SimpleVectorDb__StorageType
          value: "Volatile"

        # Directory where files are stored.
        - name: KernelMemory__Services__SimpleVectorDb__Directory
          value: "_vectors"

        # RabbitMQ

        - name: KernelMemory__Services__RabbitMq__Host
          value: "10.43.250.217"
        - name: KernelMemory__Services__RabbitMq__Port
          value: "5672"
        - name: KernelMemory__Services__RabbitMq__Username
          value: "***"
        - name: KernelMemory__Services__RabbitMq__Password
          value: "***"

        # Elasticsearch

        - name: KernelMemory__Services__Elasticsearch__CertificateFingerPrint
          value: ""
        - name: KernelMemory__Services__Elasticsearch__Endpoint
          value: "http://10.43.187.227:9200"
        - name: KernelMemory__Services__Elasticsearch__UserName
          value: "***"          
        - name: KernelMemory__Services__Elasticsearch__Password
          value: "***"
        - name: KernelMemory__Services__Elasticsearch__IndexPrefix
          value: "km-"
        - name: KernelMemory__Services__Elasticsearch__ShardCount
          value: "1"
        - name: KernelMemory__Services__Elasticsearch__Replicas
          value: "0"

        # OpenAI

        # Name of the model used to generate text (text completion or chat completion)
        - name: KernelMemory__Services__OpenAI__TextModel
          value: "gpt-3.5-turbo-16k"
        # The max number of tokens supported by the text model.
        - name: KernelMemory__Services__OpenAI__TextModelMaxTokenTotal
          value: "16384"
        # Name of the model used to generate text embeddings
        - name: KernelMemory__Services__OpenAI__EmbeddingModel
          value: "text-embedding-ada-002"
        # The max number of tokens supported by the embedding model
        # See https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
        - name: KernelMemory__Services__OpenAI__EmbeddingModelMaxTokenTotal
          value: "8191"

        # OpenAI TextGenerationType
        - name: KernelMemory__Services__OpenAI__TextGenerationType
          value: "Auto"          

        # OpenAI API Key
        - name: KernelMemory__Services__OpenAI__APIKey
          value: "***"

        # OpenAI Organization ID (usually empty, unless you have multiple accounts on different orgs)
        - name: KernelMemory__Services__OpenAI__OrgId
          value: ""

        # How many times to retry in case of throttling
        - name: KernelMemory__Services__OpenAI__MaxRetries
          value: "10"

        # Logging

        - name: Logging__LogLevel__Default
          value: "Trace"
        - name: Logging__LogLevel__Microsoft_KernelMemory_Pipeline_Queue_DevTools_SimpleQueue
          value: "Information"
        - name: Logging__LogLevel__Microsoft_AspNetCore
          value: "Trace"

        # Allowed hosts

        - name: AllowedHosts
          value: "*"

        # Urls for Kestrel server endpoints
        - name: Kestrel__Endpoints__Http__Url
          value: "http://*:9001"
        - name: Kestrel__Endpoints__Https__Url
          value: "https://*:9002"

dluc commented 1 month ago

Some of the env vars can be seen also here https://github.com/microsoft/kernel-memory/blob/main/infra/modules/container-app.bicep

For EmbeddingGeneratorTypes, since it's an array, the env var name is KernelMemory__DataIngestion__EmbeddingGeneratorTypes__0 for the first element, KernelMemory__DataIngestion__EmbeddingGeneratorTypes__1 for the second, and so on

- name: KernelMemory__DataIngestion__EmbeddingGeneratorTypes__0
  value: "Elasticsearch"

rmelilloii commented 1 month ago

Thanks, I really appreciate it. I had the impression that the variable was superseded as it didn't inform any possible value on the "comments", only a reference to a related env.

It is up now, play time! :)

Do you have any documentation regarding:

there are also optimizations, for example it's possible to turn on/off various aspects of KM, for example you could run ingestion workers on 10 VMs while running the web service only on 2-3 nodes, if that's something that could interest.

Thanks a lot!

dluc commented 1 month ago

There are two main config settings:

KernelMemory.Service.RunWebService: with this you can turn on/off the web service. For example you might have a set of workers dedicated processing jobs in the queues, and here you can turn the web service off.
KernelMemory.Service.RunHandlers: with this you can turn on/off the threads polling queues for ingestion jobs.

Handlers share state via files which can be stored on disk/azure blobs/mongodb. When using disk, it's harder to share state across VMs, unless you mount the same across.

Assuming state is shared with a central storage like blobs or DB or mounted disk, then the service can work across multiple VMs, splitting workload. For instance, this could be one setup to scale the web service separately from the async ingestion workload:

3 nodes with RunWebService=true, RunHandlers=false
N autoscale nodes, with RunWebService=false, RunHandlers=true

In the async ingestion pipelines, each task is managed by a dedicated handler. Handlers provide another way to load balance across mutliple nodes. E.g. it's possible to control which handlers to execute, using the config setting array KernelMemory.Service.Handlers

microsoft / kernel-memory

Kernel Memory on kubernetes #505

Context / Scenario

Question