microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.63k stars 313 forks source link

[Bug] SQL restriction of 256 chars for Tags #854

Closed bossbast1 closed 1 month ago

bossbast1 commented 1 month ago

Context / Scenario

Hello,

Currently you are hardcoding 256 char limit for values in SQL tags [value] NVARCHAR(256) NOT NULL This is problematic for tags longer that this restriction, in our case hyperlink, but can be path, short description, anything really. Not sure if this restriction is present in other MemoryDb, or whether it is SQL specific.

For a quick fix, I manually adjusted the column size in SQL, and it seems it works fine after. This could work, but we need the ability to create new indexes in our app, so we need this feature to be supported directly by you (if if does not break anything else).

Would it be possible to enhance code along with SqlServerConfig and give us the ability to adjust the varchar size? Either give ability to write our own number, or a bool/enum value to select between 256 (512,1024,2048...) or MAX.

example of your code that would need adjustments:

public string PrepareCreateIndexQuery(int sqlServerVersion, string index, int vectorSize)
{
    var sql = $"""
               BEGIN TRANSACTION;

                   INSERT INTO {this.GetFullTableName(this._config.MemoryCollectionTableName)}([id])
                       VALUES (@index);

                   IF OBJECT_ID(N'{this.GetFullTableName($"{this._config.TagsTableName}_{index}")}', N'U') IS NULL
                       CREATE TABLE {this.GetFullTableName($"{this._config.TagsTableName}_{index}")}
                       (
                           [memory_id] UNIQUEIDENTIFIER NOT NULL,
                           [name] NVARCHAR(256)  NOT NULL,
                           [value] NVARCHAR(256) NOT NULL,
                           FOREIGN KEY ([memory_id]) REFERENCES {this.GetFullTableName(this._config.MemoryTableName)}([id])
                       );

                   IF OBJECT_ID(N'{this.GetFullTableName($"{this._config.EmbeddingsTableName}_{index}")}', N'U') IS NULL
                       CREATE TABLE {this.GetFullTableName($"{this._config.EmbeddingsTableName}_{index}")}
                       (
                           [memory_id] UNIQUEIDENTIFIER NOT NULL,
                           [vector_value_id] [int] NOT NULL,
                           [vector_value] [float] NOT NULL,
                           FOREIGN KEY ([memory_id]) REFERENCES {this.GetFullTableName(this._config.MemoryTableName)}([id])
                       );

                   IF OBJECT_ID(N'[{this._config.Schema}.IXC_{$"{this._config.EmbeddingsTableName}_{index}"}]', N'U') IS NULL
                       CREATE CLUSTERED COLUMNSTORE INDEX [IXC_{$"{this._config.EmbeddingsTableName}_{index}"}]
                       ON {this.GetFullTableName($"{this._config.EmbeddingsTableName}_{index}")}
                       {(sqlServerVersion >= 16 ? "ORDER ([memory_id])" : "")};

               COMMIT;
               """;

    return sql;
}

What happened?

What happens now:

If tag is longer than 256 nvarchar, the ingestion fails in unfinished state, preventing file to be saved to the SQL table, which blocks further management. Also causes KernelMemory to go trough full retry cycle, unnecessarily spending processing resources.

error: String or binary data would be truncated in table 'ADS_GENAI.dbo.KMMemoriesTags_botone', column 'value'. Truncated value: 'https://firm.sharepoint.com/sites/BontactOneCS/BontactOne%20services%20and%20applications/Complia'. The statement has been terminated.

Expected behavior:

It will properly process file with tags longer than 256 chars without any failures.

Importance

a fix would make my life easier

Platform, Language, Versions

C#, net core 8

Relevant log output

String or binary data would be truncated in table 'ADS_GENAI.dbo.KMMemoriesTags_botone', column 'value'. Truncated value: 'https://firm.sharepoint.com/sites/BontactOneCS/BontactOne%20services%20and%20applications/Complia'. The statement has been terminated.

Microsoft.Data.SqlClient.SqlException: at Microsoft.Data.SqlClient.SqlConnection.OnError (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.TryRun (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.ConfiguredTaskAwaitable1+ConfiguredTaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.MemoryDb.SQLServer.SqlServerMemory+d14.MoveNext (Microsoft.KernelMemory.MemoryDb.SQLServer, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.MemoryDb.SQLServer.SqlServerMemory+d14.MoveNext (Microsoft.KernelMemory.MemoryDb.SQLServer, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore1.ThrowForFailedGetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore1.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.MemoryDb.SQLServer.SqlServerMemory+d13.MoveNext (Microsoft.KernelMemory.MemoryDb.SQLServer, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.MemoryDb.SQLServer.SqlServerMemory+d13.MoveNext (Microsoft.KernelMemory.MemoryDb.SQLServer, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.Handlers.SaveRecordsHandler+d16.MoveNext (Microsoft.KernelMemory.Core, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.Handlers.SaveRecordsHandler+d13.MoveNext (Microsoft.KernelMemory.Core, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator+d8.MoveNext (Microsoft.KernelMemory.Core, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator+<>cDisplayClass5_0+<b0>d.MoveNext (Microsoft.KernelMemory.Core, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Microsoft.KernelMemory.Orchestration.AzureQueues.AzureQueuesPipeline+<>cDisplayClass20_0+<b__0>d.MoveNext (Microsoft.KernelMemory.Orchestration.AzureQueues, Version=0.73.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)

marcominerva commented 1 month ago

I think this is somewhat related to https://github.com/microsoft/kernel-memory/discussions/781. In particular, we probably should add indexes to name and value columns in Tags table to improve query performance (this table will have at least 6 rows for each partition, plus one row for each single tag we define).

As we can read at https://learn.microsoft.com/en-us/sql/sql-server/maximum-capacity-specifications-for-sql-server#-objects, regarding index size:

900 bytes for a clustered index. 1,700 bytes for a nonclustered index. For SQL Server 2014 (12.x) and earlier, all versions supported 900 bytes for all index types.

So, if we want to make the name and value columns indexable, they can have a maximum length of 850 (given that they are NVARCHAR).

dluc commented 1 month ago

This is problematic for tags longer that this restriction, in our case hyperlink, but can be path, short description, anything really.

Could you elaborate why you're using tags to store this kind of information? It doesn't look like a typical scenario, e.g. tags are designed to store key-values for filtering.

If you are filtering by hyperlinks, paths, descriptions, you could consider storing a SHA256 hash instead (32 bytes), saving a lot of space.

bossbast1 commented 1 month ago

Hello @dluc, sure. And thank you for looking into it.

We use SQL as our data store for the front-end (FE). This is our only SQL model (no custom SQL tables). This is intentional, to avoid overcomplicating the structure and prevent duplicate entries between native and our own tables. While operations might be more costly, at the current stage, this is less of a concern for us compared to database complexity.

With your suggestion, we would need to create a separate table to store real values and map them to hashes. Additionally, not all tags are used for searching; some are simply meant as additional information that might be used for synchronization or front-end controls (e.g., the link takes you to the source URL directly, not meant for search, same for the description).

I understand that this adds clutter to the Search Service, but again, this is not a major concern for us now.

As the tool is used for managing many different indexes—each with different tags, RAG filters, control tags, and support tags—it’s beneficial to have one SQL source handling everything. Also in the tool users can define any new tag they want, fully from FE.

Finally, the reason I considered this a semi-"bug" is that neither the storage account nor the Search Service has this restriction, so it’s only SQL performance improvement that is "artificially" enforcing this length limit. That’s why I'm not suggesting we just set MAX, but rather, it would be great if we had the ability to override it ourselves to a bigger value.

PS: If SQL is being reworked as @marcominerva mentioned, a length of 850 would be sufficient for us (if that is length of one tag key-value pair, not length for the array of all values for that one key). I would say 256 should be enough for 90% of values, 512 for 98%, and 850 for 99.99%.

dluc commented 1 month ago

I appreciate the need to simplify local infrastructure, but this approach involves compromises and does not align well with the intended use cases. Tags are specifically designed for filtering, making this scenario more suitable as a feature request to include file metadata during document uploads.

An alternative approach I would consider, is using the "Request Context" to pass custom values and implementing a custom handler at the end of the pipeline to store this information in SQL storage, such as in the MemoryRecord.Payload dictionary or another table. However, this won’t address the core issue of using the index for data storage, leading to payload duplication across chunks.

For clarity, KM memory storage is designed as a search index, containing only the subset of data necessary for RAG. Storing additional data may solve one problem but introduces new challenges like increased latency, higher CPU and storage usage, and increased costs.