microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
21.36k stars 3.14k forks source link

Chroma .NET MemoryRecordMetadata field 'is_reference' is boolean but is saved as number in ChromaDB #2049

Closed alexminza closed 1 year ago

alexminza commented 1 year ago

Describe the bug Boolean is_reference field is saved as number in Chroma with default SQLite implementation. Subsequent reads fail with exception during MemoryRecordMetadata deserialization.

To Reproduce Steps to reproduce the behavior: Running this example https://github.com/microsoft/semantic-kernel/blob/main/samples/notebooks/dotnet/09-memory-with-chroma.ipynb fails on the SearchAsync step with exception deserializing is_reference number as boolean:

Error: System.Text.Json.JsonException: The JSON value could not be converted to Microsoft.SemanticKernel.Memory.MemoryRecordMetadata. Path: $.is_reference | LineNumber: 0 | BytePositionInLine: 98.
---> System.InvalidOperationException: Cannot get the value of a token type 'Number' as a boolean.
at System.Text.Json.ThrowHelper.ThrowInvalidOperationException_ExpectedBoolean(JsonTokenType tokenType)
at System.Text.Json.Utf8JsonReader.GetBoolean()
at System.Text.Json.Serialization.JsonConverter`1.TryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, T& value)
at System.Text.Json.Serialization.JsonConverter`1.TryReadAsObject(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state, Object& value)
at System.Text.Json.Serialization.Converters.LargeObjectWithParameterizedConstructorConverter`1.ReadAndCacheConstructorArgument(ReadStack& state, Utf8JsonReader& reader, JsonParameterInfo jsonParameterInfo)
at System.Text.Json.Serialization.Converters.ObjectWithParameterizedConstructorConverter`1.OnTryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, T& value)
at System.Text.Json.Serialization.JsonConverter`1.TryRead(Utf8JsonReader& reader, Type typeToConvert, JsonSerializerOptions options, ReadStack& state, T& value)
at System.Text.Json.Serialization.JsonConverter`1.ReadCore(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
--- End of inner exception stack trace ---
at System.Text.Json.ThrowHelper.ReThrowWithPath(ReadStack& state, Utf8JsonReader& reader, Exception ex)
at System.Text.Json.Serialization.JsonConverter`1.ReadCore(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
at System.Text.Json.JsonSerializer.ReadFromSpan[TValue](ReadOnlySpan`1 utf8Json, JsonTypeInfo jsonTypeInfo, Nullable`1 actualByteCount)
at System.Text.Json.JsonSerializer.ReadFromSpan[TValue](ReadOnlySpan`1 json, JsonTypeInfo jsonTypeInfo)
at Microsoft.SemanticKernel.Memory.MemoryRecord.FromJsonMetadata(String json, Nullable`1 embedding, String key, Nullable`1 timestamp)
at Microsoft.SemanticKernel.Connectors.Memory.Chroma.ChromaMemoryStore.GetMemoryRecordFromModel(List`1 metadatas, List`1 embeddings, List`1 ids, Int32 recordIndex)
at Microsoft.SemanticKernel.Connectors.Memory.Chroma.ChromaMemoryStore.GetMemoryRecordFromQueryResultModel(ChromaQueryResultModel queryResultModel, Int32 recordIndex)
at Microsoft.SemanticKernel.Connectors.Memory.Chroma.ChromaMemoryStore.GetNearestMatchesAsync(String collectionName, Embedding`1 embedding, Int32 limit, Double minRelevanceScore, Boolean withEmbeddings, CancellationToken cancellationToken)+MoveNext()
at Microsoft.SemanticKernel.Connectors.Memory.Chroma.ChromaMemoryStore.GetNearestMatchesAsync(String collectionName, Embedding`1 embedding, Int32 limit, Double minRelevanceScore, Boolean withEmbeddings, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at Microsoft.SemanticKernel.Memory.SemanticTextMemory.SearchAsync(String collection, String query, Int32 limit, Double minRelevanceScore, Boolean withEmbeddings, CancellationToken cancellationToken)+MoveNext()
at Microsoft.SemanticKernel.Memory.SemanticTextMemory.SearchAsync(String collection, String query, Int32 limit, Double minRelevanceScore, Boolean withEmbeddings, CancellationToken cancellationToken)+MoveNext()
at Microsoft.SemanticKernel.Memory.SemanticTextMemory.SearchAsync(String collection, String query, Int32 limit, Double minRelevanceScore, Boolean withEmbeddings, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at System.Linq.AsyncEnumerable.<TryGetFirst>g__Core|95_0[TSource](IAsyncEnumerable`1 source, CancellationToken cancellationToken) in /_/Ix.NET/Source/System.Linq.Async/System/Linq/Operators/FirstOrDefault.cs:line 130
at System.Linq.AsyncEnumerable.<TryGetFirst>g__Core|95_0[TSource](IAsyncEnumerable`1 source, CancellationToken cancellationToken) in /_/Ix.NET/Source/System.Linq.Async/System/Linq/Operators/FirstOrDefault.cs:line 132
at System.Linq.AsyncEnumerable.<FirstOrDefaultAsync>g__Core|91_0[TSource](IAsyncEnumerable`1 source, CancellationToken cancellationToken) in /_/Ix.NET/Source/System.Linq.Async/System/Linq/Operators/FirstOrDefault.cs:line 30
at Submission#6.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

Reading the records stored in the Chroma database:

curl --location 'http://localhost:8000/api/v1/collections/3d657269-3367-4ab6-b779-2478eda60e6b/get' \
--header 'Content-Type: application/json' \
--data '{}'
{
    "ids": [
        "info1",
        "info2",
        "info3",
        "info4",
        "info5"
    ],
    "embeddings": null,
    "metadatas": [
        {
            "additional_metadata": "",
            "description": "",
            "external_source_name": "",
            "id": "info1",
            "is_reference": 0,
            "text": "My name is Andrea"
        },
        {
            "additional_metadata": "",
            "description": "",
            "external_source_name": "",
            "id": "info2",
            "is_reference": 0,
            "text": "I currently work as a tourist operator"
        },
        {
            "additional_metadata": "",
            "description": "",
            "external_source_name": "",
            "id": "info3",
            "is_reference": 0,
            "text": "I currently live in Seattle and have been living there since 2005"
        },
        {
            "additional_metadata": "",
            "description": "",
            "external_source_name": "",
            "id": "info4",
            "is_reference": 0,
            "text": "I visited France and Italy five times since 2015"
        },
        {
            "additional_metadata": "",
            "description": "",
            "external_source_name": "",
            "id": "info5",
            "is_reference": 0,
            "text": "My family is from New York"
        }
    ],
    "documents": [
        null,
        null,
        null,
        null,
        null
    ]
}

Platform

alexminza commented 1 year ago

As I can see from the SQLite3 documentation:

SQLite does not have a separate Boolean storage class. Instead, Boolean values are stored as integers 0 (false) and 1 (true).

https://www.sqlite.org/datatype3.html#boolean_datatype

Chroma is moving away from DuckDB to SQLite:

https://github.com/chroma-core/chroma/issues/400

Migrate duckdb to sqllite

https://docs.trychroma.com/migration#migration-from-040-to-040---july-17-2023

New data layout This version of Chroma drops duckdb and clickhouse in favor of sqlite for metadata storage. This means migrating data over. We have created a migration CLI utility to do this.

alexminza commented 1 year ago

@dmytrostruk it seems that you are the expert on this subject :)

Chroma memory store - C# implementation (#1634) https://github.com/microsoft/semantic-kernel/commit/85d420f77c5c8a59eb39deaf78aaff2aaf3b2337

dmytrostruk commented 1 year ago

@alexminza Thanks for creating this issue! I will take a look and create a PR to resolve it.

nathansolidatus commented 1 year ago

same issue here, is there any workaround before we got a fix ?

dmytrostruk commented 1 year ago

same issue here, is there any workaround before we got a fix ?

@nathansolidatus I assume, if that worked previously, maybe it worth to try previous versions of Chroma. Meanwhile, I'm going to create PR today and we will release a fix in coming days.

alexminza commented 1 year ago

Issue seems to have appeared since Chroma 4.0.0 which set SQLite as the default database engine.

https://github.com/chroma-core/chroma/releases/tag/0.4.0

dmytrostruk commented 1 year ago

@alexminza Thank you for catching this! PR is opened: https://github.com/microsoft/semantic-kernel/pull/2072