Closed rmelilloii closed 1 month ago
hi @rmelilloii great to hear the solution could help!
KM should totally work fine in Kubernetes. I would try first with the docker image mentioned in the main README. Configuration can be provided via a file or env vars, let me know if you encounter any problem.
Aside for the basic docker image, there are also optimizations, for example it's possible to turn on/off various aspects of KM, for example you could run ingestion workers on 10 VMs while running the web service only on 2-3 nodes, if that's something that could interest.
Hello @dluc good morning and happy Tuesday! Thanks for your message. I will then read the "service/Service/appsettings.json" to figure out which variables I can inform on my yaml.
My initial stateful deployment should be something like:
I am indeed interested in the different workloads for better resource utilisation. To avoid too much noise here I will do tests and post back with results/doubts to help similar needs, maybe adding working examples to the repo.
Thanks!
Hello again o/ @dluc , sorry to bother.
Deploy goes green (pod running) but it is on a crash loop. The pod log is not enough to help with a cause. Any suggestion? It complains about: "DataIngestion.EmbeddingGeneratorTypes", which has no value on the documentation.
Elasticsearch and RabbiMQ endpoints are up and accepting connections (Auth validated).
Any help is very appreciated.
Thanks!
Log:
******
Data ingestion embedding generation (DataIngestion.EmbeddingGeneratorTypes) is not configured.
Please configure the service and retry.
How to configure the service:
1. Set the ASPNETCORE_ENVIRONMENT env var to "Development" or "Production".
Current value: Development
Documentation:
manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: msft-km
labels:
service: msft-km
spec:
replicas: 1
selector:
matchLabels:
service: msft-km
template:
metadata:
labels:
service: msft-km
spec:
containers:
- name: msft-km
image: kernelmemory/service:latest
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 9001
protocol: TCP
- name: https
containerPort: 9002
protocol: TCP
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Development"
# Whether to run the web service that allows to upload files and search memory
# Use these booleans to deploy the web service and the handlers on same/different VMs
- name: KernelMemory__Service__RunWebService
value: "true"
# Whether to run the asynchronous pipeline handlers
# Use these booleans to deploy the web service and the handlers on same/different VMs
- name: KernelMemory__Service__RunHandlers
value: "true"
# Whether to expose OpenAPI swagger UI at http://127.0.0.1:9001/swagger/index.html
- name: KernelMemory__Service__OpenApiEnabled
value: "false"
# Whether clients must provide some credentials to interact with the HTTP API
- name: KernelMemory__ServiceAuthorization__Enabled
value: "false"
# Currently "APIKey" is the only type supported
- name: KernelMemory__ServiceAuthorization__AuthenticationType
value: "APIKey"
# HTTP header name to check
- name: KernelMemory__ServiceAuthorization__HttpHeaderName
value: "Authorization"
# Define two separate API Keys, to allow key rotation. Both are active.
# Keys must be different and case-sensitive, and at least 32 chars long.
# Contain only alphanumeric chars and allowed symbols.
# Symbols allowed: . _ - (dot, underscore, minus).
- name: KernelMemory__ServiceAuthorization__AccessKey1
value: "***"
- name: KernelMemory__ServiceAuthorization__AccessKey2
value: "***"
# "AzureBlobs" or "SimpleFileStorage"
- name: KernelMemory__ContentStorageType
value: "SimpleFileStorage"
# "AzureOpenAIText", "OpenAI" or "LlamaSharp"
- name: KernelMemory__TextGeneratorType
value: "OpenAI"
# "AzureOpenAIText", "OpenAI" or "LlamaSharp"
- name: KernelMemory__DefaultIndexName
value: "noName"
# - InProcess: in process .NET orchestrator, synchronous/no queues
# - Distributed: asynchronous queue based orchestrator
- name: KernelMemory__DataIngestion__OrchestrationType
value: "Distributed"
# "AzureQueue", "RabbitMQ", "SimpleQueues"
- name: KernelMemory__DataIngestion__DistributedOrchestration__QueueType
value: "RabbitMQ"
# Whether the pipeline generates and saves the vectors/embeddings in the memory DBs.
# When using a memory DB that automatically generates embeddings internally,
# or performs semantic search internally anyway, this should be False,
# and avoid generating embeddings that are not used.
# Examples:
# * you are using Azure AI Search "semantic search" without "vector search": in this
# case you don't need embeddings because Azure AI Search uses a more advanced approach
# internally.
# * you are using a custom Memory DB connector that generates embeddings on the fly
# when writing records and when searching: in this case you don't need the pipeline
# to calculate embeddings, because your connector does all the work.
# * you are using a basic "text search" and a DB without "vector search": in this case
# embeddings would be unused, so it's better to disable them to save cost and latency.
- name: KernelMemory__DataIngestion__EmbeddingGenerationEnabled
value: "true"
# Vectors can be written to multiple storages, e.g. for data migration, A/B testing, etc.
# "AzureAISearch", "Qdrant", "SimpleVectorDb"
- name: KernelMemory__DataIngestion__MemoryDbTypes__0
value: "Elasticsearch"
# "None" or "AzureAIDocIntel"
- name: KernelMemory__DataIngestion__ImageOcrType
value: "None"
# "AzureOpenAIEmbedding" or "OpenAI"
# This is the generator registered for `ITextEmbeddingGeneration` dependency injection.
- name: KernelMemory__Retrieval__EmbeddingGeneratorType
value: "OpenAI"
# "AzureAISearch", "Qdrant", "SimpleVectorDb"
- name: KernelMemory__Retrieval__MemoryDbType
value: "Elasticsearch"
# Maximum number of tokens accepted by the LLM used to generate answers.
# The number includes the tokens used for the answer, e.g. when using
# GPT4-32k, set this number to 32768.
# If the value is not set or less than one, SearchClient will use the
# max amount of tokens supported by the model in use.
- name: KernelMemory__Retrieval__SearchClient__MaxAskPromptSize
value: "-1"
# Maximum number of relevant sources to consider when generating an answer.
# The value is also used as the max number of results returned by SearchAsync
# when passing a limit less or equal to zero.
- name: KernelMemory__Retrieval__SearchClient__MaxMatchesCount
value: "100"
# How many tokens to reserve for the answer generated by the LLM.
# E.g. if the LLM supports max 4000 tokens, and AnswerTokens is 300, then
# the prompt sent to LLM will contain max 3700 tokens, composed by
# prompt + question + grounding information retrieved from memory.
- name: KernelMemory__Retrieval__SearchClient__AnswerTokens
value: "300"
# Text to return when the LLM cannot produce an answer.
- name: KernelMemory__Retrieval__SearchClient__EmptyAnswer
value: "INFO NOT FOUND"
# Options: "Disk" or "Volatile". Volatile data is lost after each execution.
- name: KernelMemory__Services__SimpleFileStorage__StorageType
value: "Volatile"
# Directory where files are stored.
- name: KernelMemory__Services__SimpleFileStorage__Directory
value: "_files"
# Options: "Disk" or "Volatile". Volatile data is lost after each execution.
- name: KernelMemory__Services__SimpleQueues__StorageType
value: "Volatile"
# Directory where files are stored.
- name: KernelMemory__Services__SimpleQueues__Directory
value: "_queues"
# Options: "Disk" or "Volatile". Volatile data is lost after each execution.
- name: KernelMemory__Services__SimpleVectorDb__StorageType
value: "Volatile"
# Directory where files are stored.
- name: KernelMemory__Services__SimpleVectorDb__Directory
value: "_vectors"
# RabbitMQ
- name: KernelMemory__Services__RabbitMq__Host
value: "10.43.250.217"
- name: KernelMemory__Services__RabbitMq__Port
value: "5672"
- name: KernelMemory__Services__RabbitMq__Username
value: "***"
- name: KernelMemory__Services__RabbitMq__Password
value: "***"
# Elasticsearch
- name: KernelMemory__Services__Elasticsearch__CertificateFingerPrint
value: ""
- name: KernelMemory__Services__Elasticsearch__Endpoint
value: "http://10.43.187.227:9200"
- name: KernelMemory__Services__Elasticsearch__UserName
value: "***"
- name: KernelMemory__Services__Elasticsearch__Password
value: "***"
- name: KernelMemory__Services__Elasticsearch__IndexPrefix
value: "km-"
- name: KernelMemory__Services__Elasticsearch__ShardCount
value: "1"
- name: KernelMemory__Services__Elasticsearch__Replicas
value: "0"
# OpenAI
# Name of the model used to generate text (text completion or chat completion)
- name: KernelMemory__Services__OpenAI__TextModel
value: "gpt-3.5-turbo-16k"
# The max number of tokens supported by the text model.
- name: KernelMemory__Services__OpenAI__TextModelMaxTokenTotal
value: "16384"
# Name of the model used to generate text embeddings
- name: KernelMemory__Services__OpenAI__EmbeddingModel
value: "text-embedding-ada-002"
# The max number of tokens supported by the embedding model
# See https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
- name: KernelMemory__Services__OpenAI__EmbeddingModelMaxTokenTotal
value: "8191"
# OpenAI TextGenerationType
- name: KernelMemory__Services__OpenAI__TextGenerationType
value: "Auto"
# OpenAI API Key
- name: KernelMemory__Services__OpenAI__APIKey
value: "***"
# OpenAI Organization ID (usually empty, unless you have multiple accounts on different orgs)
- name: KernelMemory__Services__OpenAI__OrgId
value: ""
# How many times to retry in case of throttling
- name: KernelMemory__Services__OpenAI__MaxRetries
value: "10"
# Logging
- name: Logging__LogLevel__Default
value: "Trace"
- name: Logging__LogLevel__Microsoft_KernelMemory_Pipeline_Queue_DevTools_SimpleQueue
value: "Information"
- name: Logging__LogLevel__Microsoft_AspNetCore
value: "Trace"
# Allowed hosts
- name: AllowedHosts
value: "*"
# Urls for Kestrel server endpoints
- name: Kestrel__Endpoints__Http__Url
value: "http://*:9001"
- name: Kestrel__Endpoints__Https__Url
value: "https://*:9002"
Some of the env vars can be seen also here https://github.com/microsoft/kernel-memory/blob/main/infra/modules/container-app.bicep
For EmbeddingGeneratorTypes
, since it's an array, the env var name is KernelMemory__DataIngestion__EmbeddingGeneratorTypes__0
for the first element, KernelMemory__DataIngestion__EmbeddingGeneratorTypes__1
for the second, and so on
- name: KernelMemory__DataIngestion__EmbeddingGeneratorTypes__0
value: "Elasticsearch"
Thanks, I really appreciate it. I had the impression that the variable was superseded as it didn't inform any possible value on the "comments", only a reference to a related env.
It is up now, play time! :)
Do you have any documentation regarding:
there are also optimizations, for example it's possible to turn on/off various aspects of KM, for example you could run ingestion workers on 10 VMs while running the web service only on 2-3 nodes, if that's something that could interest.
Thanks a lot!
There are two main config settings:
KernelMemory.Service.RunWebService
: with this you can turn on/off the web service. For example you might have a set of workers dedicated processing jobs in the queues, and here you can turn the web service off.KernelMemory.Service.RunHandlers
: with this you can turn on/off the threads polling queues for ingestion jobs.Handlers share state via files which can be stored on disk/azure blobs/mongodb. When using disk, it's harder to share state across VMs, unless you mount the same across.
Assuming state is shared with a central storage like blobs or DB or mounted disk, then the service can work across multiple VMs, splitting workload. For instance, this could be one setup to scale the web service separately from the async ingestion workload:
RunWebService=true
, RunHandlers=false
RunWebService=false
, RunHandlers=true
In the async ingestion pipelines, each task is managed by a dedicated handler. Handlers provide another way to load balance across mutliple nodes. E.g. it's possible to control which handlers to execute, using the config setting array KernelMemory.Service.Handlers
Context / Scenario
Hello, good morning/afternoon/evening and Happy Monday!
Initially, I must say that I am very impressed with this solution and keen to implement it as an internal service to our k8s clusters.
Question
To my question: I went through all the documentation available. As of now, is it possible to deploy it to kubernetes?
Thanks!