pgvector / pgvector-php

pgvector support for PHP
MIT License
122 stars 7 forks source link

Do not return results if no matches found #13

Closed sarfraznawaz2005 closed 2 months ago

sarfraznawaz2005 commented 2 months ago

Thanks for great package in the first place.

I am using L2 distance but issue I am running into is that even if i search for gibberish text, results are returned. Here is my migration:


DB::statement('CREATE EXTENSION IF NOT EXISTS vector');

Schema::create('documents', function (Blueprint $table) {
  $table->id();
  $table->longText('content');
  $table->string('content_hash', 64)->unique();
  $table->string('llm')->nullable();
  $table->vector('embedding_1536', 1536)->nullable(); // OpenAI
  $table->vector('embedding_768', 768)->nullable(); // Gemini
  $table->json('metadata');
  $table->string('filename');
  $table->timestamps();
});

DB::statement('CREATE INDEX ON documents USING hnsw (embedding_768 vector_l2_ops)');
DB::statement('CREATE INDEX ON documents USING hnsw (embedding_1536 vector_l2_ops)');

Model:

class Document extends Model
{
    use HasNeighbors;

    protected $guarded = [];

    protected $connection = null;

    protected $casts = [
        'embedding_1536' => Vector::class,
        'embedding_768' => Vector::class,
        'metadata' => 'array',
    ];
......

Insert Code:

$llm = config('doctalk.llm.llm_provider', 'gemini');

foreach ($texts as $text) {

    $content = $text['text'];
    $contentHash = hash('sha256', $content);

    $values = [
        'content' => $content,
        'llm' => $llm,
        'metadata' => $text['metadata'] ?? [],
        'filename' => $text['file']
    ];

    $embeddings = new Vector($text['embeddings']);

    if ($llm === 'openai') {
        $values['embedding_1536'] = $embeddings;
    } elseif ($llm === 'gemini') {
        $values['embedding_768'] = $embeddings;
    }

    Document::query()->updateOrCreate(['content_hash' => $contentHash], $values);
}

Fetch Code:

// openai or gemini
$queryEmbeddings = new Vector($queryEmbeddings['embeddings'][0]['values'] ?? $queryEmbeddings[0]['embedding']);

// Combine with ORDER BY and LIMIT to use an index
return Document::query()
    ->select(['id', 'content', 'llm', 'metadata'])
    ->selectRaw("$field <-> ? AS score", [$queryEmbeddings])
    ->orderByRaw('score ASC') // in L2, lower is better
    ->limit(5)
    ->get()
    ->toArray();

In order to avoid getting results for gibberish text, I tried adding score field above in order to filter based on score. The issue then is that even for non-existing text, I see score similar to those for matched records. Example:

Search String: sdfsdfsdfsdfsdfdgrytytu567658yumhgj5674323rdfbfghgfhfg
Exists in db: No
Score: 0.9454763916695315

Search String: who is kumar
Exists in db: Yes
Score: 0.9628448784970052

Search String: who is sarwana
Exists in db: Yes
Score: 0.913253360841312

Notice the score for existing and non-existing texts.

I tried with cosine and others, but scores don't make sense.

Probably I am doing something wrong. The reason why I want to do this is because I want to avoid sending a query to lllm when records do not fall under certain threshold and so that I can get an opportunity to search via other methods if semantic search fails, I should be able to know beforehand. Right now, I don't get a chance to run fallback search methods because above fetch code always returns records even for text that is not there in any of documents. If I could get scoring right, it would be helpful.

Thanks for the help.

ankane commented 2 months ago

Hi @sarfraznawaz2005, I don't see anything wrong from the info above. I'd double check that you're passing the expected text to the LLMs for both documents and queries. If the text is longer, you may want to try different chunking strategies. Also, I'd remove any vector indexes while debugging so exact search is used.

Edit: You may also want to add ->where('llm', ...) to the query to ensure you're not comparing embeddings across different models (which would produce meaningless scores).

sarfraznawaz2005 commented 2 months ago

@ankane thanks for the reply.

I tried different chunk sizes, same result. Yes the results given by Document::query() below are passed to llm and for query, embeddings are generated via llm and passed to same Document::query() code below. For code below:

Document::query()
    ->select(['id', 'content', 'llm', 'metadata'])
    ->selectRaw("$field <-> ? AS score", [$queryEmbeddings])
    ->orderByRaw('score ASC') // in L2, lower is better
    ->limit(5)
    ->get()

The actual query becomes:

select "id", "content", "llm", "metadata", embedding_768 <-> '[-0.042995043,0.020061782,-0.012362629,0.037271198,0.014026179,0.052065067, until 768]' AS score from "documents" order by score ASC limit 5

In above query I passed gibberish text sdfsdfsdfsdfsdfdgrytytu567658yumhgj5674323rdfbfghgfhfg and it still gave results as below:

image

For now, llm is always gemini and that's only llm being used currently so I don't think adding that in where clause would help.

I also tried without indexing, scores still don't make sense. I also verified correct embeddings are saved for each piece of text by generating embeddings for different texts and comparing with what is saved in db.

Is there some other way to modify query or some way so I can get correct scoring or filter based on score.

Here is full code if that helps: code

Migration: https://github.com/sarfraznawaz2005/docchat/blob/main/src/Migrations/2024_08_03_000000_create_tables.php#L10

Add Docs to DB: https://github.com/sarfraznawaz2005/docchat/blob/main/src/Models/Document.php#L45

Fetch: https://github.com/sarfraznawaz2005/docchat/blob/main/src/Services/LLMUtilities.php#L103

ankane commented 2 months ago

The query will return the correct score based on the vectors and distance function.

It looks like you're using the SEMANTIC_SIMILARITY task type instead of RETRIEVAL_DOCUMENT and RETRIEVAL_QUERY for Gemini, which could be the issue. https://ai.google.dev/gemini-api/tutorials/document_search

Please use other resources for additional help (as this isn't an issue with pgvector).

sarfraznawaz2005 commented 2 months ago

@ankane thanks for the reply. It seems issue is something else, I will figure out, thanks for the help.