run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.78k stars 4.91k forks source link

[Bug]: The score ts_rank on Postgres is relatively low compare to Embedding Distance score from PgVector, and cause problem on Hybrid Search #10576

Closed rendyfebry closed 5 months ago

rendyfebry commented 6 months ago

Bug Description

There are two problems with Postgres Hybrid Search.

  1. The score from ts_rank used by Keyword Search is relatively low compared distance used by Vector Similarity Search. This will make the result from Keyword Search always in lower ranking compared to the result from Vector Similarity Search, or worst the keyword search result will be removed completely if you use Minimum Similarity PostProcessor
  2. The score range from ts_rank is pretty wide, please see my example below.

Need to find a way to normalize both of them.

Version

latest

Steps to Reproduce

Run this query on your Postgres DB, and compare the score with you regular PgVector query result.

SELECT
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog'), 
    to_tsquery('fox')
  ) AS mentioned_once_short_sentence,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy fox'), 
    to_tsquery('fox')
  ) AS mentioned_twice_short_sentence,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    to_tsquery('anthropic')
  ) AS mentioned_once_long_sentence_1,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    plainto_tsquery('english', 'Mustafa')
  ) AS mentioned_once_long_sentence_2a,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    plainto_tsquery('english', 'Suleyman')
  ) AS mentioned_once_long_sentence_2b,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    plainto_tsquery('english', 'Mustafa Suleyman')
  ) AS mentioned_once_long_sentence_2c,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    to_tsquery('OpenAI')
  ) AS mentioned_twice_long_sentence,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    plainto_tsquery('english', 'Altman')
  ) AS never_mentioned_1,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    plainto_tsquery('english', 'satya nadella')
  ) AS never_mentioned_2,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy dog. OpenAI is not the only firm in its industry with an odd structure. Anthropic, created by rebels from OpenAI, and Inflection AI (whose co-founder, Mustafa Suleyman, is a board member of The Economist’s parent company)'), 
    to_tsquery('Fox | Altman')
  ) AS never_mentioned_combined,
  ts_rank(
    to_tsvector('dog dog dog dog dog'), 
    to_tsquery('dog')
  ) AS all_5_repeated_word,
  ts_rank(
    to_tsvector('dog dog dog dog dog dog dog dog dog dog'), 
    to_tsquery('dog')
  ) AS all_10_repeated_word,
  ts_rank(
    to_tsvector('dog dog dog dog dog dog dog dog dog fox'), 
    to_tsquery('dog')
  ) AS all_9_repeated_word_anomaly,
  ts_rank(
    to_tsvector('dog'), 
    to_tsquery('dog')
  ) AS perfect_match_1,
  ts_rank(
    to_tsvector('The quick brown fox jumped over the lazy fox'), 
    plainto_tsquery('english', 'The quick brown fox jumped over the lazy fox')
  ) AS perfect_match_2,
   ts_rank(
    to_tsvector('Editor’s note (November 22nd 2023): OpenAI said it had agreed “in principle” that Sam Altman would rejoin the artificial-intelligence firm as its chief executive under a new board. 
    “The mission continues,” tweeted Sam Altman, the co-founder of , the startup behind ChatGPT, on November 19th. But precisely where it will continue remains unclear. Mr Altman’s tweet was part of an announcement that he was joining Microsoft. Two days earlier, to the astonishment of Silicon Valley,  from Openai for not being “consistently candid in his communications with the board”. Then Satya Nadella, Microsoft’s boss, announced that Mr Altman would “lead a new advanced AI [artificial intelligence] research team” within the tech giant. At first it looked like Mr Altman would be accompanied by just a few former colleagues. Many more may follow. The vast majority of OpenAI’s 770 staff have signed a letter threatening to resign if the board fails to reinstate Mr Altman. 
    The shenanigans involving the world’s hottest startup are not over. The Verge, a tech-focused online publication, has reported that Mr Altman may be willing to return to OpenAI, if the board members responsible for his dismissal themselves resign. Mr Nadella also seems to allow for that possibility. His manoeuvring could look shrewd either way. If Mr Altman returns, then Microsoft, Openai’s biggest investor, would have supported him at a time of crisis, strengthening an important corporate relationship. If Mr Altman and friends do join Microsoft, Mr Nadella could look even smarter. He would have brought in house the talent and technology that the world’s second-most valuable company is betting its future on. 
    Microsoft has long invested in various forms of AI. It first announced it was working with OpenAI in 2016, and has since invested $13bn in the startup for what is reported to be a 49% stake. The deal means that Openai’s technology has to run on Azure, Microsoft’s cloud-computing arm. In exchange OpenAI has access to enormous amounts of Microsoft’s processing power, which it needs to “train” its powerful models. 
    The investment became crucial to Microsoft one year ago with the launch of ChatGPT. The chatbot became the fastest-growing consumer software application in history, reaching 100m users in two months. Since then Microsoft has been busy working out how to infuse the startup’s technology into its software. It has launched ChatGPT-like bots to run alongside many of its offerings, including its productivity tools, such as Word and Excel; Bing, its search engine; and even its Windows operating system. 
    Bringing parts of OpenAI in-house would be a smart move. The technology is central to Microsoft’s future. Having direct control over it eliminates the risk that OpenAI could take its technology in a different direction. And such influence would have been attained for a bargain. Before he was fired, Mr Altman was hoping to raise fresh funds for OpenAI that would value the firm at around $86bn. Hiring OpenAI’s boffins this way is something antitrust regulators would find harder to challenge than a straightforward acquisition. Investors appear keen. Microsoft’s share price fell slightly on the news of Mr Altman’s firing. That loss was reversed when his new gig was announced. 
    Yet the move would also entail risks. One is reputational. A pillar of Microsoft’s AI strategy has been to keep the technology at arm’s length, thus insulating the company from any embarrassment caused when ChatGPT goes awry. When Meta, Facebook’s parent company, released Galactica, its science AI chatbot, the tool started to fabricate research. The public response was critical enough for Meta to take it down.'), 
    plainto_tsquery('english', 'Why was Sam Altman ousted by the OpenAI board?')
  ) AS full_article_nlp_question,
  ts_rank(
    to_tsvector('Editor’s note (November 22nd 2023): OpenAI said it had agreed “in principle” that Sam Altman would rejoin the artificial-intelligence firm as its chief executive under a new board. 
    “The mission continues,” tweeted Sam Altman, the co-founder of , the startup behind ChatGPT, on November 19th. But precisely where it will continue remains unclear. Mr Altman’s tweet was part of an announcement that he was joining Microsoft. Two days earlier, to the astonishment of Silicon Valley,  from Openai for not being “consistently candid in his communications with the board”. Then Satya Nadella, Microsoft’s boss, announced that Mr Altman would “lead a new advanced AI [artificial intelligence] research team” within the tech giant. At first it looked like Mr Altman would be accompanied by just a few former colleagues. Many more may follow. The vast majority of OpenAI’s 770 staff have signed a letter threatening to resign if the board fails to reinstate Mr Altman. 
    The shenanigans involving the world’s hottest startup are not over. The Verge, a tech-focused online publication, has reported that Mr Altman may be willing to return to OpenAI, if the board members responsible for his dismissal themselves resign. Mr Nadella also seems to allow for that possibility. His manoeuvring could look shrewd either way. If Mr Altman returns, then Microsoft, Openai’s biggest investor, would have supported him at a time of crisis, strengthening an important corporate relationship. If Mr Altman and friends do join Microsoft, Mr Nadella could look even smarter. He would have brought in house the talent and technology that the world’s second-most valuable company is betting its future on. 
    Microsoft has long invested in various forms of AI. It first announced it was working with OpenAI in 2016, and has since invested $13bn in the startup for what is reported to be a 49% stake. The deal means that Openai’s technology has to run on Azure, Microsoft’s cloud-computing arm. In exchange OpenAI has access to enormous amounts of Microsoft’s processing power, which it needs to “train” its powerful models. 
    The investment became crucial to Microsoft one year ago with the launch of ChatGPT. The chatbot became the fastest-growing consumer software application in history, reaching 100m users in two months. Since then Microsoft has been busy working out how to infuse the startup’s technology into its software. It has launched ChatGPT-like bots to run alongside many of its offerings, including its productivity tools, such as Word and Excel; Bing, its search engine; and even its Windows operating system. 
    Bringing parts of OpenAI in-house would be a smart move. The technology is central to Microsoft’s future. Having direct control over it eliminates the risk that OpenAI could take its technology in a different direction. And such influence would have been attained for a bargain. Before he was fired, Mr Altman was hoping to raise fresh funds for OpenAI that would value the firm at around $86bn. Hiring OpenAI’s boffins this way is something antitrust regulators would find harder to challenge than a straightforward acquisition. Investors appear keen. Microsoft’s share price fell slightly on the news of Mr Altman’s firing. That loss was reversed when his new gig was announced. 
    Yet the move would also entail risks. One is reputational. A pillar of Microsoft’s AI strategy has been to keep the technology at arm’s length, thus insulating the company from any embarrassment caused when ChatGPT goes awry. When Meta, Facebook’s parent company, released Galactica, its science AI chatbot, the tool started to fabricate research. The public response was critical enough for Meta to take it down.'), 
    plainto_tsquery('english', 'Sam Altman')
  ) AS full_article_simple_keyword;
type                            | keyword(s)        | score       |
--------------------------------+-------------------+-------------+
mentioned_once_short_sentence   | fox               | 0.06079271  |
mentioned_twice_short_sentence  | fox               | 0.075990885 |
mentioned_once_long_sentence_1  | anthropic         | 0.06079271  |
mentioned_once_long_sentence_2a | Mustafa           | 0.06079271  |
mentioned_once_long_sentence_2b | Suleyman          | 0.06079271  |
mentioned_once_long_sentence_2c | Mustafa Suleyman  | 0.09910322  |
mentioned_once_long_sentence_2d | Suleyman Mustafa  | 0.09910322  |
mentioned_twice_long_sentence   | OpenAI            | 0.075990885 |
never_mentioned_1               | Altman            | 0.0         |
never_mentioned_2               | satya nadella     | 0           |
never_mentioned_combined        | Fox or Altman     | 0.030396355 |
all_5_repeated_word             | dog               | 0.08897689  |
all_10_repeated_word            | dog               | 0.09421459  |
all_9_repeated_word_anomaly     | dog               | 0.09360665  |
perfect_match_1                 | dog               | 0.06079271  |
perfect_match_2                 | full sentencce    | 0.38170186  |
full_article_nlp_question       | nlp question      | 0.6724103   |
full_article_simple_keyword     | Sam Altman        | 0.19074538  | 

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 6 months ago

🤖

Hey @rendyfebry, nice to see you diving into the depths of Postgres again! How's the coding journey treating you?

Based on the context provided, it seems you're trying to normalize the scores from ts_rank in Postgres Hybrid Search. One way to achieve this is by using the ts_rank_cd function instead of ts_rank. The ts_rank_cd function normalizes the scores between 0 and 1, which can help to balance the scores from Keyword Search and Vector Similarity Search.

Here's how you can modify the _build_sparse_query method in the PGVectorStore class to use ts_rank_cd:

def _build_sparse_query(
    self,
    query_str: Optional[str],
    limit: int,
    metadata_filters: Optional[MetadataFilters] = None,
) -> Any:
    from sqlalchemy import select, type_coerce
    from sqlalchemy.sql import func, text
    from sqlalchemy.types import UserDefinedType

    class REGCONFIG(UserDefinedType):
        def get_col_spec(self, **kw: Any) -> str:
            return "regconfig"

    if query_str is None:
        raise ValueError("query_str must be specified for a sparse vector query.")

    ts_query = func.plainto_tsquery(
        type_coerce(self.text_search_config, REGCONFIG), query_str
    )
    stmt = (
        select(  # type: ignore
            self._table_class.id,
            self._table_class.node_id,
            self._table_class.text,
            self._table_class.metadata_,
            func.ts_rank_cd(self._table_class.text_search_tsv, ts_query).label("rank"),  # use ts_rank_cd here
        )
        .where(self._table_class.text_search_tsv.op("@@")(ts_query))
        .order_by(text("rank desc"))
    )

    # type: ignore
    return self._apply_filters_and_limit(stmt, limit, metadata_filters)

This change will normalize the scores from ts_rank between 0 and 1, which should help to balance the scores from Keyword Search and Vector Similarity Search.

Currently, the LlamaIndex codebase uses the ts_rank function for scoring in Postgres Hybrid Search. This is evident from the _build_sparse_query method in the PGVectorStore class. Here, the ts_rank function is used to rank the results of a text search query. However, there is no usage of the ts_rank_cd function in the provided context.

Please note that this change should be tested thoroughly to ensure it doesn't introduce any unexpected behavior. If you encounter any issues or need further assistance, feel free to ask.

You can find more information about the ts_rank and ts_rank_cd functions in the PostgreSQL documentation.

Sources:

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 6 months ago

I think ideally, relative rank fusion is applied (either in the vector store class, or as a node postprocessor?)

This blog posts details the two most common fusion algorithms https://weaviate.io/blog/hybrid-search-fusion-algorithms

rendyfebry commented 6 months ago

Exactly @logan-markewich, I'm about to share that blog post.

BTW, I think this case is not exclusive to Postgres VectorStore then, do you have a list of which VectorStore that do hybrid searches on the DB level and which do on the LlamaIndex level? And from there, how many that already implement fusion/normalization?

rendyfebry commented 6 months ago

Oh BTW, I just read the Postgres Text Search documentation, and apparently, they also have a normalization option. https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING

I have tried, but the result is nowhere close to the Vector Similarity Score standard. Maybe you have a better luck.

One more thing, PgVector actually have an example of Reciprocal Rank Fusion

logan-markewich commented 6 months ago

Reciprocal rank fusion is ok-ish, I think I personally like relative-rank a bit more. Just based off personal experience though haha.

There is a list here of all vector db features https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html#vector-store-options-feature-support

juleskuehn commented 5 months ago

I wouldn't say this is a bug, more of a feature request. I am working on a PR to add relative score fusion, which will solve your problem.