qdrant / qdrant

Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
https://qdrant.tech
Apache License 2.0
20.79k stars 1.42k forks source link

Full text matching, hope it doesn’t achieve any effect??? #5045

Open SevenMpp opened 2 months ago

SevenMpp commented 2 months ago

Using full text match, the expectation is that the matching string can be hit and ranked at the top, but the current effect does not seem to have any effect. version: Qdrant v1.9.5

Current Behavior

Use the following command to observe that data containing strings is not recalled: POST /collections/sha/points/scroll { "must": [ { "key": "content", "match": { "text": "Focus" } } ], "limit": 2, "with_payload": true } or POST /collections/sha/points/search { "vector": [ -0.21554970741271973, 0.16919100284576416, -0.7354516983032227 ], "must": [ { "key": "content", "match": { "text": "common stop words" } } ], "limit": 4, "with_payload":true }

Steps to Reproduce

3.collection : scheme image

  1. search result: image
coszio commented 2 months ago

Hi @SevenMpp I tried to reproduce your case, but it worked correctly to me in latest and in v1.9.5

In my MRE, I ran the following requests in order

PUT collections/sha
{
  "vectors":{}
}

PUT collections/sha/index 
{
  "field_name": "content",
  "field_schema": {
    "type": "text",
    "tokenizer": "word",
    "min_token_len": 2,
    "max_token_len": 15,
    "lowercase": true
  }
}

PUT collections/sha/points
{
  "points": [
    { "id": 1, "vectors": {}, "payload": {"content": "London is very rainy" } },
    { "id": 2, "vectors": {}, "payload": {"content": "Mexico City is very crowded" } },
    { "id": 3, "vectors": {}, "payload": {"content": "London is in the UK" } },
    { "id": 4, "vectors": {}, "payload": {"content": "Berlin is the capital of techno music" } }
  ]
}

POST collections/sha/points/query
{
  "filter": {"must": { "key": "content", "match": {"text": "is the"}}},
  "with_payload": true
}

Result:

{
  "result": [
    {
      "id": 3,
      "version": 0,
      "score": 0,
      "payload": {
        "content": "London is in the UK"
      },
      "vector": null,
      "order_value": null
    },
    {
      "id": 4,
      "version": 0,
      "score": 0,
      "payload": {
        "content": "Berlin is the capital of techno music"
      },
      "vector": null,
      "order_value": null
    }
  ],
  "status": "ok",
  "time": 0.003149917
}

Could you please provide a reproducible example?

SevenMpp commented 2 months ago

Thank you very much for your reply. Expectations are re-vectors, full-text match, and filter conditions. The hope is that full-text matching can be ranked at the top, but currently only "full text match" content is matched. All non-matching content is filtered out. Can it be directly integrated and used?

PUT /collections/sha { "vectors": { "size": 3, "distance": "Cosine" } }

PUT collections/sha/index { "field_name": "content", "field_schema": { "type": "text", "tokenizer": "word", "min_token_len": 2, "max_token_len": 15, "lowercase": true } }

PUT collections/sha/points { "points": [{ "id": 1, "vectors": [-0.21554970741271973, 0.16919100284576416, -0.7354516983032227 ], "payload": { "content": "Goal" } }, { "id": 2, "vectors": [1.0798169374465942, -0.24099303781986237, -0.005861682817339897], "payload": { "content": "Summarize the text given and extract keywords from the summary.  Identify the input text's language(For example\n\nChinese, English)\t\tSummarize the text" } }, { "id": 3, "vectors": [0.7270970940589905, -0.43327197432518005, -0.6609529256820679], "payload": { "content": "Extract keywords from the summary (Note: do not extract keywords directly from the original text). Keywords should represent the main topics or themes discussed in the text." } }, { "id": 4, "vectors": [-0.9851164817810059, 0.5659856200218201, -0.2668682336807251], "payload": { "content": ". Ignore common stop words (e.g., the, is, and, of). Focus on nouns, noun phrases, and verbs that carry the main ideas" } } ] }

POST /collections/sha/points/query {

"vector": [1.0798169374465942, -0.24099303781986237, -0.005861682817339897], "filter":{ "should": { "key": "content", "match": { "text": "common stop words" } } }, "limit": 5, "with_payload":true

}

result: { "result": [ { "id": 4, "version": 0, "score": 0, "payload": { "content": ". Ignore common stop words (e.g., the, is, and, of). Focus on nouns, noun phrases, and verbs that carry the main ideas" }, "vector": null, "order_value": null } ], "status": "ok", "time": 0.000490212 }

generall commented 2 months ago

but currently only "full text match" content is matched. All non-matching content is filtered out. Can it be directly integrated and used?

If you are looking for hybrid search, I would recommend to start from here https://qdrant.tech/documentation/concepts/hybrid-queries/