fix: postgres search query does not use index

roldengarm commented 3 months ago

Motivation and Context (Why the change? What's the scenario?)

I've imported about 6 million documents with PostgresDB as backend. I've created an HNSW index. When doing a search (using WebClient.Ask/Search) I keep getting timeouts. Upon further investigation, I saw that the query used in Kernel Memory is not optimisable as it's doing an inline calculation of "1 - the difference".

After I removed the "1 - " in our fork & deployed it, it was successfully using the index and I would get a response within 5 - 10 seconds. I've used pgAdmin4 to confirm that the new query is using the index.

See also discussion with @dluc here: https://github.com/microsoft/kernel-memory/discussions/663#discussioncomment-9952593

marcominerva commented 3 months ago

I think that, as this PR changes the returned similarityActualValue from PostgresSQL, also the minSimilarityusage must be adjusted.

In fact, going deep in the code, I see this:

https://github.com/microsoft/kernel-memory/blob/3d34260ae513af48030da9a56aa50b8e0162c6f8/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs#L419

https://github.com/microsoft/kernel-memory/blob/3d34260ae513af48030da9a56aa50b8e0162c6f8/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs#L432-L448

In particular, the foreach adds sqlUserValues (included similarityPlaceholder= "@__min_similarity") to query parameters, but this parameter seems not to be used.

Instead, records are filtered by similarity after the query execution:

https://github.com/microsoft/kernel-memory/blob/3d34260ae513af48030da9a56aa50b8e0162c6f8/extensions/Postgres/Postgres/Internals/PostgresDbClient.cs#L458-L469

roldengarm commented 3 months ago

Thanks for checking @marcominerva I think you're right. Perhaps it worked at my end as minSimilarity was 0, haven't checked. But yes, I guess the check would have to be flipped.

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

marcominerva commented 3 months ago

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

Yes, it is exactly what I mean when I said that minSimilarity parameter is currently not used in the query.

roldengarm commented 3 months ago

But, wouldn't it be better & easier to read if the minSimilarity check would be part of the query? Then it could also be optimised by Postgres, instead of having an in-memory search / filtering afterwards. Would need to check filterSql a bit further, as in my test case it was empty / TRUE.

Yes, it is exactly what I mean when I said that minSimilarity parameter is currently not used in the query.

Hi @marcominerva I've made the changes you recommended, the difference filter is now being done in the SQL query. For some reason, I could not do "WHERE {colDifference} < {maxDifference}", that was causing a "table 'km-default'" can not be found.

The query still performs well, but on my large dataset I'm getting different results now. I'll investigate further as obviously the results should be similar. If you have a chance to review and see any obvious mistakes, that would be great.

marcominerva commented 3 months ago

What you have called difference is actually the distance (the closer it is to 0, the closer the vectors themselves are). I suggest you to rename the code in this way to better specify the intent.

Moreover, Kernel Memory works with the concept of relevance (the closer it is to 1, more relevant the vectors are). All memories return records with the most relevance first. See for example:

https://github.com/microsoft/kernel-memory/blob/3d34260ae513af48030da9a56aa50b8e0162c6f8/service/Core/Search/SearchClient.cs#L227-L238

Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list.

roldengarm commented 3 months ago

What you have called difference is actually the distance (the closer it is to 0, the closer the vectors themselves are). I suggest you to rename the code in this way to better specify the intent.

Thanks, I've changed it to distance.

Instead, now in you're code the result list is ordered in an ascending way based on the distance (even if you calculate the similarity, the list is created reading the records that are returned by the query). So, I think you need to reverse the result list.

I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different. Can you please do a review and see if it can be merged? @marcominerva

roldengarm commented 3 months ago

@microsoft-github-policy-service agree [company="GenText"]

roldengarm commented 3 months ago

@microsoft-github-policy-service agree company="GenText"

marcominerva commented 3 months ago

I have already reversed the ORDER BY from DESC (on similarity) to ASC (on distance), so I think that's correct. Upon further investigation it turns out the results were not different. Can you please do a review and see if it can be merged? @marcominerva

I too have made a test with the old and the new approach and I have verified that now results are returned in the correct order. I would suggest only a last minor fix: rename the colDistance variable using __ as prefix, as in the original code:

string colDistance = "__distance";

Remember also to correct the comment at line 437 (you're still using colDifference).

After that, the code seems OK to me. Now we must wait for the approval by @dluc.

dluc commented 3 months ago

could you grant maintainers edit on the PR?

roldengarm commented 2 months ago

could you grant maintainers edit on the PR?

Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs.

dluc commented 2 months ago

could you grant maintainers edit on the PR?

Sorry @dluc , that option isn't there, I guess because it's not a user-owned fork, but an org owned? Unless I'm missing something, but I believe the option should appear here on the right at the bottom based on Github Docs.

please upvote https://github.com/orgs/community/discussions/5634

microsoft / kernel-memory

fix: postgres search query does not use index #684

Motivation and Context (Why the change? What's the scenario?)