spring-projects / spring-ai

An Application Framework for AI Engineering
https://docs.spring.io/spring-ai/reference/1.0-SNAPSHOT/index.html
Apache License 2.0
2.85k stars 711 forks source link

Similarity Search Limited to First 10 Chunks Only (10 hits) #1090

Closed zelhaddioui closed 1 month ago

zelhaddioui commented 1 month ago

Description:

I am encountering an issue with the ElasticsearchVectorStore class when performing a similarity search. Specifically, when I execute a search with a topK value set to 2, it seems to only apply the search to the first 10 chunks stored in Elasticsearch, rather than considering all the chunks.

Details:

Library Version: 1.0.0
Elasticsearch Version: 8.13.3
Code Example:
List<Document> similarDocuments = vectorStore.similaritySearch(
    SearchRequest.query(message).withTopK(2)
);

Issue Observed:

When executing the above code, I expect to retrieve the top 2 most similar documents from all available chunks in Elasticsearch. However, it appears that the search is only applied to the first 10 chunks stored in Elasticsearch, rather than considering all chunks.

Additional Information:

I suspect that the issue might be related to how Elasticsearch pagination is handled or a limitation in the current implementation of the similarity search method. I would appreciate any guidance or fixes to ensure that the search applies to all chunks stored in Elasticsearch.

zelhaddioui commented 1 month ago

I think the cause of the issue is that the similarity search method is limiting the number of candidates considered (numCandidates) in the Elasticsearch KNN query, which prevents it from evaluating all possible documents in the index.

eljid-oussama commented 1 month ago

yup , i have the same issue .I think that the elasticsearch algo must be changed.

inpink commented 1 month ago

Hello, This is an interesting issue.

I have written test code in an environment with Elasticsearch version 8.13.3 that includes more than 10 documents, and I have confirmed that the tests pass.

I referred to the code in the ElasticsearchVectorStoreIT class.

I am curious about your situation. I might not have properly understood your situation. So, could you please provide more details about the problem you are facing? 😄

@Testcontainers
@EnabledIfEnvironmentVariable(named = "OPENAI_API_KEY", matches = ".+")
public class ElasticsearchVectorStoreIT {

    @Container
    private static final ElasticsearchContainer elasticsearchContainer = new ElasticsearchContainer(
            "docker.elastic.co/elasticsearch/elasticsearch:8.13.3")
        .withEnv("xpack.security.enabled", "false");

    private final List<Document> documents = List.of(
            new Document("1", "Document content aa", Map.of("meta1", "value1")),
            new Document("2", "Document content aa", Map.of("meta1", "value1")),
            new Document("3", "Document content aa", Map.of("meta1", "value1")),
            new Document("4", "Document content aa", Map.of("meta1", "value1")),
            new Document("5", "Document content aaa", Map.of("meta1", "value1")),
            new Document("6", "Document content aa", Map.of("meta1", "value1")),
            new Document("7", "Document content aa", Map.of("meta1", "value1")),
            new Document("8", "Document content aa", Map.of("meta1", "value1")),
            new Document("9", "Document content aa", Map.of("meta1", "value1")),
            new Document("10", "Document content aa", Map.of("meta1", "value1")),
            new Document("11", "Document content aa", Map.of("meta1", "value1")),
            new Document("12", "Document content aa", Map.of("meta1", "value1")),
            new Document("13", "Document content aa", Map.of("meta1", "value1")),
            new Document("14", "Document content aa", Map.of("meta1", "value1")),
            new Document("15", "Document content aa", Map.of("meta1", "value1")),
            new Document("16", "Document content aaa", Map.of("meta1", "value1")),
            new Document("17", "Document content aa", Map.of("meta1", "value1")),
            new Document("18", "Document content aa", Map.of("meta1", "value1")),
            new Document("19", "Document content aa", Map.of("meta1", "value1")),
            new Document("20", "Document content aa", Map.of("meta1", "value1")),
            new Document("21", "Document content aa", Map.of("meta1", "value1")),
            new Document("22", "Document content aa", Map.of("meta1", "value1")),
            new Document("23", "Document content aa", Map.of("meta1", "value1")),
            new Document("24", "Document content aa", Map.of("meta1", "value1"))
    );

    @BeforeAll
    public static void beforeAll() {
        Awaitility.setDefaultPollInterval(2, TimeUnit.SECONDS);
        Awaitility.setDefaultPollDelay(Duration.ZERO);
        Awaitility.setDefaultTimeout(Duration.ofMinutes(1));
    }

    private String getText(String uri) {
        var resource = new DefaultResourceLoader().getResource(uri);
        try {
            return resource.getContentAsString(StandardCharsets.UTF_8);
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private ApplicationContextRunner getContextRunner() {
        return new ApplicationContextRunner().withUserConfiguration(TestApplication.class);
    }

    @BeforeEach
    void cleanDatabase() {
        getContextRunner().run(context -> {
            // deleting indices and data before following tests
            ElasticsearchClient elasticsearchClient = context.getBean(ElasticsearchClient.class);
            List indices = elasticsearchClient.cat().indices().valueBody().stream().map(IndicesRecord::index).toList();
            if (!indices.isEmpty()) {
                elasticsearchClient.indices().delete(del -> del.index(indices));
            }
        });
    }

    @ParameterizedTest(name = "{0} : {displayName} ")
    @ValueSource(strings = { "cosine", "l2_norm", "dot_product" })
    void similaritySearchConsidersAllChunks(String similarityFunction) {

        getContextRunner().run(context -> {

            ElasticsearchVectorStore vectorStore = context.getBean("vectorStore_" + similarityFunction,
                    ElasticsearchVectorStore.class);

            vectorStore.add(documents);

            Awaitility.await().until(() -> vectorStore.similaritySearch(SearchRequest.query("Document content").withTopK(2)), hasSize(2));

            List<Document> results = vectorStore.similaritySearch(SearchRequest.query("Document content aaa").withTopK(2));
            assertThat(results).hasSize(2);

            // Verify that the results are from the entire set of documents, not just the first 10
            assertThat(results).extracting(Document::getId).containsExactlyInAnyOrder("5", "16");
        });
    }

 ...
image
zelhaddioui commented 1 month ago

I'm trying to build a Retrieval-Augmented Generation (RAG) system using Elasticsearch as the database. Here is my VectorStoreConfig:

@Configuration
public class VectorStoreConfig {

    private static final Logger logger = LoggerFactory.getLogger(VectorStoreConfig.class);

    @Bean
    public RestClient restClient() {
        return RestClient.builder(
                new HttpHost("localhost", 9200, "http")
        ).build();
    }

    @Bean
    public ElasticsearchClient elasticsearchClient(RestClient restClient) {
        return new ElasticsearchClient(new RestClientTransport(restClient, new JacksonJsonpMapper(new ObjectMapper())));
    }

    @Bean
    public VectorStore vectorStoreEs(RestClient restClient, EmbeddingModel embeddingModel) {
        ElasticsearchVectorStoreOptions options = new ElasticsearchVectorStoreOptions();
        return new ElasticsearchVectorStore(options, restClient, embeddingModel, true);
    }
}

However, when I use the similaritySearch function in Elasticsearch, it only searches through the first 10 chunks. For example, if I have 34 chunks, even though I can verify that all of them are in Elasticsearch, when I ask a question about a chunk that is not in the first 10, it doesn't find the similar chunks.

Here's the code where I call the similarity search:

    List<Document> similarDocuments = vectorStore.similaritySearch(SearchRequest.query(message).withTopK(2));
    log.info("Similar documents: {}", similarDocuments);

I think the problem is linked to this part of the function similaritySearch :

SearchResponse<Document> res = elasticsearchClient.search(
                    sr -> sr.index(options.getIndexName())
                        .knn(knn -> knn.queryVector(vectors)
                            .similarity(finalThreshold)
                            .k((long) searchRequest.getTopK())
                            .field("embedding")
                            .numCandidates((long) (1.5 * searchRequest.getTopK()))
                            .filter(fl -> fl.queryString(
                                    qs -> qs.query(getElasticsearchQueryString(searchRequest.getFilterExpression()))))),
                    Document.class);

Any help on why similaritySearch isn't applied to the entire collection would be greatly appreciated.

inpink commented 1 month ago

@zelhaddioui
Thank you very much for your detailed response. It’s quite fascinating. Could you please share what kind of data you have put into Elasticsearch? I would like to test it once more myself.

zelhaddioui commented 1 month ago

@inpink I have primarily used PDF files that contain only text. These files are processed to extract the text content, which is then indexed in Elasticsearch. Each document in Elasticsearch corresponds to a chunk of text extracted from these PDFs.

inpink commented 1 month ago

@zelhaddioui Thank you. Could you also share the actual query message and data that you have put into Elasticsearch? (like the "Document content aa" and 'Map.of("meta1", "value1")" in my test code. ) I’d like to insert the same data into my Elasticsearch, use the code you provided, and see if I encounter the same issue. Thank you for your detailed response.

zelhaddioui commented 1 month ago

I realised today that the problem isn't with elasticsearch, elasticsearch wasn't able to detect the similarity chunks because I'm using a weak embedding model.