microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.52k stars 291 forks source link

[Bug] Postgres causes ArgumentOutOfRangeException in PostgresDbClient.ConnectAsync #789

Open roldengarm opened 1 week ago

roldengarm commented 1 week ago

Context / Scenario

We're running Kernel Memory as a docker image on Google Cloud Run, and using Google Postgres SQL as vector index. Queue & Storage is on Azure Blob Storage.

When ingesting documents, it throws an ArgumentOutOfRangeException when trying to connect to Postgres.

What happened?

When ingesting documents, it throws an ArgumentOutOfRangeException when trying to connect to Postgres.

Instead, I expect it to ingest successfully.

It has worked initially when I tested to ingest about 10 documents. But now it no longer works, despite restarting it. I'm trying to ingest the same documents, so it is trying to delete the record now first, which then fails with below error.

Importance

I cannot use Kernel Memory

Platform, Language, Versions

Using latest Docker version on Google Cloud Run using Google Postgres SQL, using the latest docker image. Connection string is configured like so: value: Host=/cloudsql/PROJECT_ID:REGIO:km-postgres;Port=5432;Username=USER;Password=PASS;Database=km-default

On Postgres, I've run below to enable vector indexing & configured an index:

CREATE EXTENSION vector;
CREATE INDEX ON "km-default" USING hnsw (embedding vector_cosine_ops);

Queue & storage are on Azure Storage.

Relevant log output

[01:29:02.748] trce: Microsoft.KernelMemory.Postgres.PostgresDbClient[0] Deleting record 'd=28906//p=6d15ea9691ba4b5b8598a6504e29a011' from table 'public."km-default"'

System.ArgumentOutOfRangeException: Exception of type 'System.ArgumentOutOfRangeException' was thrown. (Parameter 'factory') Actual value was (, ).    
at Npgsql.Internal.PgTypeInfoResolverChainBuilder.Build(Action`1 configure)    
at Npgsql.NpgsqlSlimDataSourceBuilder.PrepareConfiguration()    
at Npgsql.NpgsqlSlimDataSourceBuilder.Build()    at Microsoft.KernelMemory.Postgres.PostgresDbClient.ConnectAsync(CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 670    
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DeleteAsync(String tableName, String id, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 631    
at Microsoft.KernelMemory.Handlers.SaveRecordsHandler.DeletePreviousRecordsAsync(DataPipeline pipeline, CancellationToken cancellationToken) in /src/service/Core/Handlers/SaveRecordsHandler.cs:line 319    
at Microsoft.KernelMemory.Handlers.SaveRecordsHandler.InvokeAsync(DataPipeline pipeline, CancellationToken cancellationToken) in /src/service/Core/Handlers/SaveRecordsHandler.cs:line 111    
at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.RunPipelineStepAsync(DataPipeline pipeline, IPipelineStepHandler handler, CancellationToken cancellationToken) in /src/service/Core/Pipeline/DistributedPipelineOrchestrator.cs:line 226    
at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator.<>c__DisplayClass5_0.<<AddHandlerAsync>b__0>d.MoveNext() in /src/service/Core/Pipeline/DistributedPipelineOrchestrator.cs:line 167 --- End of stack trace from previous location ---    
at Microsoft.KernelMemory.Orchestration.AzureQueues.AzureQueuesPipeline.<>c__DisplayClass20_0.<<OnDequeue>b__0>d.MoveNext() in /src/extensions/AzureQueues/AzureQueuesPipeline.cs:line 197
roldengarm commented 1 week ago

Switching to private IP & using TCP instead of Unix sockets seems to have fixed the issue.

roldengarm commented 4 days ago

Unfortunately, after a restart of KM, the problem has come back.

See new error log here:

[22:06:48.563] fail: Microsoft.KernelMemory.Postgres.PostgresMemory[0] DB error while attempting to create index System.ArgumentOutOfRangeException: Exception of type 'System.ArgumentOutOfRangeException' was thrown. (Parameter 'factory') Actual value was (, ).    at Npgsql.Internal.PgTypeInfoResolverChainBuilder.Build(Action`1 configure)    
at Npgsql.NpgsqlSlimDataSourceBuilder.PrepareConfiguration()    at Npgsql.NpgsqlSlimDataSourceBuilder.Build()    
at Microsoft.KernelMemory.Postgres.PostgresDbClient.ConnectAsync(CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 670    
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 99    
at Microsoft.KernelMemory.Postgres.PostgresMemory.CreateIndexAsync(String index, Int32 vectorSize, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/PostgresMemory.cs:line 65

Could it be related to this PR @dluc @marcominerva ?

It is pretty strange, it works and then randomly stops working.

marcominerva commented 4 days ago

I have just tried using a local deployment of the service and a connection string to Postgres like this:

Server=XXX.XXX.XXX.XXX;Username=db_user;Database=Memory;Port=5432;Password=XXXXX;SSLMode=Prefer

And everything works as expected, I get no exception.

Could you try running the service locally?

roldengarm commented 3 days ago

I have just tried using a local deployment of the service and a connection string to Postgres like this:

Server=XXX.XXX.XXX.XXX;Username=db_user;Database=Memory;Port=5432;Password=XXXXX;SSLMode=Prefer

And everything works as expected, I get no exception.

Could you try running the service locally?

Thanks for your reply @marcominerva . In my case it works fine initially as well, but it stops working after some time. I've now reverted to an older version before the changes and then the problem no longer occurs after 16+ hours On a side note, I'm ingesting data continuously, so KM is very busy.

So, to me it seems the connection changes seem to have introduced a regression issue.

marcominerva commented 3 days ago

I would expect an error of this type to occur systematically. The fact that happens randomly is very strange. Looking at the source code of npgsql, based on your stack trace, it seems that ArgumentOutOfRangeException is throw in this method of PgTypeInfoResolverChainBuilder:

static PgTypeInfoResolverFactory GetInstance((Type, object Instance) factory) => factory.Instance switch
{
    PgTypeInfoResolverFactory f => f,
    Func<PgTypeInfoResolverFactory> f => f(),
    _ => throw new ArgumentOutOfRangeException(nameof(factory), factory, null)
};

If it is the right point, it seems that the factory is "lost", that is weird.

In the file PostgresDbClient.cs, the NpgsqlDataSource is created here:

https://github.com/microsoft/kernel-memory/blob/0205769ebd2281246db573e3952a04bf86d4eb6d/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs#L666-L675

The NpgsqlDataSourceBuilder, in turn, is created in the class constructor:

https://github.com/microsoft/kernel-memory/blob/0205769ebd2281246db573e3952a04bf86d4eb6d/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs#L55-L56

Among the other things, it is responsible of adding some resolver factories to the builder itself. This object is never disposed (it doesn't implement the IDisposable interface, nor the PostgresDbClient itself is disposable), but it seems that, in some way, the factories defined by the builder are lost, causing the issue.

I think we can try two solutions:

private async Task<NpgsqlConnection> ConnectAsync(CancellationToken cancellationToken = default)
{
    try
    {
        var dataSourceBuilder = new(this._connectionString);
        dataSourceBuilder.UseVector();

        var dataSource = dataSourceBuilder.Build();
        await using (dataSource.ConfigureAwait(false))
        {
            return await dataSource.OpenConnectionAsync(cancellationToken).ConfigureAwait(false);
        }
    }

@dluc, what you think about?

dluc commented 2 days ago

Looks like we keep having problems with the postgres connector, even after cleaning and refactoring the code. There might be some fundamentally wrong approach, I don't know. I use Postgres a lot but only for short runs, e.g. few hours at most, and I never faced any problem.

About upgrading Npgsql, it should be easy to test it locally without a full release.

Other things we can do is try catching the exception and handling it explicitly. Could you include the full stack trace?

roldengarm commented 2 days ago

@marcominerva thank you very much for the insights. I had a look at the code as well, but I'm not sure why it would randomly throw an error.

@dluc the stack traces are in my comments. I've added some line breaks to make it clearer.

Today I did another test. Reverted back to the latest version, and ran the ingestion process. This time after ~1.5 hours I got ArgumentException 's, but a completely different one this time. Stack trace below. I've restarted it there, will let you know if I get another exception.

Our production instance which is under a higher load, but on the older version, still runs fine without exceptions.

fail: Microsoft.KernelMemory.Postgres.PostgresMemory[0] DB error while attempting to create index System.ArgumentException: Invalid bitstring character 'p' at index: 0 (Parameter 'value')    
at Npgsql.Internal.Converters.StringBitStringConverter.GetSize(SizeContext context, String value, Object& writeState)    
at Npgsql.Internal.PgConverter`1.GetSizeAsObject(SizeContext context, Object value, Object& writeState)    
at Npgsql.Internal.PgTypeInfo.BindObject(PgConverter converter, Object value, Size& size, Object& writeState, DataFormat& format, Nullable`1 formatPreference)    at Npgsql.NpgsqlParameter.Bind(DataFormat& format, Size& size, Nullable`1 requiredFormat)    
at Npgsql.NpgsqlParameterCollection.ProcessParameters(PgSerializerOptions options, Boolean validateValues, CommandType commandType)    at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)    at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)    at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 123    
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 136    
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140    
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140    
at Microsoft.KernelMemory.Postgres.PostgresMemory.CreateIndexAsync(String index, Int32 vectorSize, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/PostgresMemory.cs:line 65

About upgrading Npgsql, it should be easy to test it locally without a full release.

I only have a Google Cloud Run set up using the Docker image, so I can't run a custom build. More than happy to test if you're able to push a version to Docker Hub

roldengarm commented 8 hours ago

After ~24 hours I'm getting ArgumentException's again on the latest Docker version, stack trace below. Still all good on the older Docker version. So I'm pretty sure that the recent changes in the Postgres connection have caused this regression issue.

fail: Microsoft.KernelMemory.Postgres.PostgresMemory[0] DB error while attempting to create index System.ArgumentException: Invalid bitstring character 'p' at index: 0 (Parameter 'value')    at Npgsql.Internal.Converters.StringBitStringConverter.GetSize(SizeContext context, String value, Object& writeState)    at Npgsql.Internal.PgConverter`1.GetSizeAsObject(SizeContext context, Object value, Object& writeState)    at Npgsql.Internal.PgTypeInfo.BindObject(PgConverter converter, Object value, Size& size, Object& writeState, DataFormat& format, Nullable`1 formatPreference)    at Npgsql.NpgsqlParameter.Bind(DataFormat& format, Size& size, Nullable`1 requiredFormat)    at Npgsql.NpgsqlParameterCollection.ProcessParameters(PgSerializerOptions options, Boolean validateValues, CommandType commandType)    at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)    at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)    at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 123    at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 136    at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140    at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140    at Microsoft.KernelMemory.Postgres.PostgresMemory.CreateIndexAsync(String index, Int32 vectorSize, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/PostgresMemory.cs:line 65