Open roldengarm opened 1 week ago
Switching to private IP & using TCP instead of Unix sockets seems to have fixed the issue.
Unfortunately, after a restart of KM, the problem has come back.
See new error log here:
[22:06:48.563] fail: Microsoft.KernelMemory.Postgres.PostgresMemory[0] DB error while attempting to create index System.ArgumentOutOfRangeException: Exception of type 'System.ArgumentOutOfRangeException' was thrown. (Parameter 'factory') Actual value was (, ). at Npgsql.Internal.PgTypeInfoResolverChainBuilder.Build(Action`1 configure)
at Npgsql.NpgsqlSlimDataSourceBuilder.PrepareConfiguration() at Npgsql.NpgsqlSlimDataSourceBuilder.Build()
at Microsoft.KernelMemory.Postgres.PostgresDbClient.ConnectAsync(CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 670
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 99
at Microsoft.KernelMemory.Postgres.PostgresMemory.CreateIndexAsync(String index, Int32 vectorSize, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/PostgresMemory.cs:line 65
Could it be related to this PR @dluc @marcominerva ?
It is pretty strange, it works and then randomly stops working.
I have just tried using a local deployment of the service and a connection string to Postgres like this:
Server=XXX.XXX.XXX.XXX;Username=db_user;Database=Memory;Port=5432;Password=XXXXX;SSLMode=Prefer
And everything works as expected, I get no exception.
Could you try running the service locally?
I have just tried using a local deployment of the service and a connection string to Postgres like this:
Server=XXX.XXX.XXX.XXX;Username=db_user;Database=Memory;Port=5432;Password=XXXXX;SSLMode=Prefer
And everything works as expected, I get no exception.
Could you try running the service locally?
Thanks for your reply @marcominerva . In my case it works fine initially as well, but it stops working after some time. I've now reverted to an older version before the changes and then the problem no longer occurs after 16+ hours On a side note, I'm ingesting data continuously, so KM is very busy.
So, to me it seems the connection changes seem to have introduced a regression issue.
I would expect an error of this type to occur systematically. The fact that happens randomly is very strange. Looking at the source code of npgsql, based on your stack trace, it seems that ArgumentOutOfRangeException
is throw in this method of PgTypeInfoResolverChainBuilder:
static PgTypeInfoResolverFactory GetInstance((Type, object Instance) factory) => factory.Instance switch
{
PgTypeInfoResolverFactory f => f,
Func<PgTypeInfoResolverFactory> f => f(),
_ => throw new ArgumentOutOfRangeException(nameof(factory), factory, null)
};
If it is the right point, it seems that the factory
is "lost", that is weird.
In the file PostgresDbClient.cs, the NpgsqlDataSource
is created here:
The NpgsqlDataSourceBuilder
, in turn, is created in the class constructor:
Among the other things, it is responsible of adding some resolver factories to the builder itself. This object is never disposed (it doesn't implement the IDisposable
interface, nor the PostgresDbClient
itself is disposable), but it seems that, in some way, the factories defined by the builder are lost, causing the issue.
I think we can try two solutions:
NpgsqlDataSourceBuilder
inside the ConnectAsync
method, something like:private async Task<NpgsqlConnection> ConnectAsync(CancellationToken cancellationToken = default)
{
try
{
var dataSourceBuilder = new(this._connectionString);
dataSourceBuilder.UseVector();
var dataSource = dataSourceBuilder.Build();
await using (dataSource.ConfigureAwait(false))
{
return await dataSource.OpenConnectionAsync(cancellationToken).ConfigureAwait(false);
}
}
@dluc, what you think about?
Looks like we keep having problems with the postgres connector, even after cleaning and refactoring the code. There might be some fundamentally wrong approach, I don't know. I use Postgres a lot but only for short runs, e.g. few hours at most, and I never faced any problem.
About upgrading Npgsql, it should be easy to test it locally without a full release.
Other things we can do is try catching the exception and handling it explicitly. Could you include the full stack trace?
@marcominerva thank you very much for the insights. I had a look at the code as well, but I'm not sure why it would randomly throw an error.
@dluc the stack traces are in my comments. I've added some line breaks to make it clearer.
Today I did another test. Reverted back to the latest version, and ran the ingestion process. This time after ~1.5 hours I got ArgumentException
's, but a completely different one this time. Stack trace below.
I've restarted it there, will let you know if I get another exception.
Our production instance which is under a higher load, but on the older version, still runs fine without exceptions.
fail: Microsoft.KernelMemory.Postgres.PostgresMemory[0] DB error while attempting to create index System.ArgumentException: Invalid bitstring character 'p' at index: 0 (Parameter 'value')
at Npgsql.Internal.Converters.StringBitStringConverter.GetSize(SizeContext context, String value, Object& writeState)
at Npgsql.Internal.PgConverter`1.GetSizeAsObject(SizeContext context, Object value, Object& writeState)
at Npgsql.Internal.PgTypeInfo.BindObject(PgConverter converter, Object value, Size& size, Object& writeState, DataFormat& format, Nullable`1 formatPreference) at Npgsql.NpgsqlParameter.Bind(DataFormat& format, Size& size, Nullable`1 requiredFormat)
at Npgsql.NpgsqlParameterCollection.ProcessParameters(PgSerializerOptions options, Boolean validateValues, CommandType commandType) at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken) at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 123
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 136
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140
at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140
at Microsoft.KernelMemory.Postgres.PostgresMemory.CreateIndexAsync(String index, Int32 vectorSize, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/PostgresMemory.cs:line 65
About upgrading Npgsql, it should be easy to test it locally without a full release.
I only have a Google Cloud Run set up using the Docker image, so I can't run a custom build. More than happy to test if you're able to push a version to Docker Hub
After ~24 hours I'm getting ArgumentException's again on the latest Docker version, stack trace below. Still all good on the older Docker version. So I'm pretty sure that the recent changes in the Postgres connection have caused this regression issue.
fail: Microsoft.KernelMemory.Postgres.PostgresMemory[0] DB error while attempting to create index System.ArgumentException: Invalid bitstring character 'p' at index: 0 (Parameter 'value') at Npgsql.Internal.Converters.StringBitStringConverter.GetSize(SizeContext context, String value, Object& writeState) at Npgsql.Internal.PgConverter`1.GetSizeAsObject(SizeContext context, Object value, Object& writeState) at Npgsql.Internal.PgTypeInfo.BindObject(PgConverter converter, Object value, Size& size, Object& writeState, DataFormat& format, Nullable`1 formatPreference) at Npgsql.NpgsqlParameter.Bind(DataFormat& format, Size& size, Nullable`1 requiredFormat) at Npgsql.NpgsqlParameterCollection.ProcessParameters(PgSerializerOptions options, Boolean validateValues, CommandType commandType) at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken) at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 123 at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 136 at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140 at Microsoft.KernelMemory.Postgres.PostgresDbClient.DoesTableExistAsync(String tableName, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs:line 140 at Microsoft.KernelMemory.Postgres.PostgresMemory.CreateIndexAsync(String index, Int32 vectorSize, CancellationToken cancellationToken) in /src/extensions/Postgres/Postgres/PostgresMemory.cs:line 65
Context / Scenario
We're running Kernel Memory as a docker image on Google Cloud Run, and using Google Postgres SQL as vector index. Queue & Storage is on Azure Blob Storage.
When ingesting documents, it throws an ArgumentOutOfRangeException when trying to connect to Postgres.
What happened?
When ingesting documents, it throws an ArgumentOutOfRangeException when trying to connect to Postgres.
Instead, I expect it to ingest successfully.
It has worked initially when I tested to ingest about 10 documents. But now it no longer works, despite restarting it. I'm trying to ingest the same documents, so it is trying to delete the record now first, which then fails with below error.
Importance
I cannot use Kernel Memory
Platform, Language, Versions
Using latest Docker version on Google Cloud Run using Google Postgres SQL, using the latest docker image. Connection string is configured like so:
value: Host=/cloudsql/PROJECT_ID:REGIO:km-postgres;Port=5432;Username=USER;Password=PASS;Database=km-default
On Postgres, I've run below to enable vector indexing & configured an index:
Queue & storage are on Azure Storage.
Relevant log output