milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.48k stars 2.82k forks source link

[Bug]: The performance of text match is not significantly better than that of like expressions, and in some cases, it may even be worse. #36173

Open zhuwenxing opened 1 week ago

zhuwenxing commented 1 week ago

Is there an existing issue for this?

Environment

- Milvus version:longjiquan-text-match-ca1ac6b-20240910
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-09-11 11:56:15 - INFO - ci_test]: count 500000 (test_query.py:4422)
Analyze document cost time: 0.16799402236938477
Analyze document cost time: 1.8109688758850098
[2024-09-11 11:56:40 - INFO - ci_test]: expr: TextMatch(title, 'anarchism') (test_query.py:4439)
[2024-09-11 11:56:40 - INFO - ci_test]: text match query cost 0.40004992485046387 res len 18 res data: ["{'id': '105859', 'title': 'anarchism and violence'}", "{'id': '1063286', 'title': 'history of anarchism'}", "{'id': '12', 'title': 'anarchism'}", "{'id': '1211469', 'title': 'lifestyle anarchism'}", "{'id': '1249918', 'title': 'anarchism and capitalism'}", "{'id': '1325940', 'title': 'anarchism and the arts'}", "{'id': '1332770', 'title': 'anarchism and religion'}", "{'id': '1433310', 'title': 'anarchism in africa'}", "{'id': '14936', 'title': 'individualist anarchism'}", "{'id': '1596739', 'title': 'anarchism in south africa'}"] ... (test_query.py:4446)
[2024-09-11 11:56:40 - INFO - ci_test]: expr: title like '%anarchism%' (test_query.py:4448)
[2024-09-11 11:56:40 - INFO - ci_test]: like match query cost 0.28357505798339844 res len 18 res data: ["{'id': '105859', 'title': 'anarchism and violence'}", "{'id': '1063286', 'title': 'history of anarchism'}", "{'id': '12', 'title': 'anarchism'}", "{'id': '1211469', 'title': 'lifestyle anarchism'}", "{'id': '1249918', 'title': 'anarchism and capitalism'}", "{'id': '1325940', 'title': 'anarchism and the arts'}", "{'id': '1332770', 'title': 'anarchism and religion'}", "{'id': '1433310', 'title': 'anarchism in africa'}", "{'id': '14936', 'title': 'individualist anarchism'}", "{'id': '1596739', 'title': 'anarchism in south africa'}"] ... (test_query.py:4455)
[2024-09-11 11:56:40 - INFO - ci_test]: text match cost 0.40004992485046387, like match cost 0.28357505798339844 (test_query.py:4457)
[2024-09-11 11:56:40 - INFO - ci_test]: expr: TextMatch(text, 'anarchism') (test_query.py:4439)
[2024-09-11 11:56:41 - INFO - ci_test]: text match query cost 0.4805600643157959 res len 87 res data: ['{\'id\': \'105859\', \'text\': \'anarchism and violence have become closely connected in popular thought, in part because of a concept of "propaganda of the deed". propaganda of the deed, or "attentát", was espoused by leading anarchists in the late nineteenth century, and was associated with a number of incidents of violence. anarchist thought, however, is quite diverse on the question of violence. in the name of coherence some anarchists have opposed coercion, while others have supported it, particularly in the form of violent revolution on the path to anarchy. anarchism includes a school of thought which rejects all violence (anarcho-pacifism).\'}', '{\'id\': \'1062947\', \'text\': \'loompanics unlimited was an american book seller and publisher specializing in nonfiction on generally unconventional or controversial topics. the topics in their title list included drugs, weapons, anarchism, sex, conspiracy theories, and so on. many of their titles describe some kind of illicit or extralegal actions, such as "counterfeit i.d. made easy", while others are purely informative, like "opium for the masses". loompanics was in business for nearly 30 years. the publisher and editor was michael hoy.\'}', "{'id': '1063286', 'text': 'anarchism is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions, but that several authors have defined as more specific institutions based on non-hierarchical free associations. anarchism holds the state to be undesirable, unnecessary, or harmful. while anti-statism is central, anarchism entails opposing authority or hierarchical organisation in the conduct of human relations, including, but not limited to, the state system.'}", '{\'id\': \'1072099\', \'text\': "paul avrich (august 4, 1931 – february 16, 2006) was a historian of the 19th and early 20th century anarchist movement in russia and the united states. he taught at queens college, city university of new york, for his entire career, from 1961 to his retirement as distinguished professor of history in 1999. he wrote ten books, mostly about anarchism, including topics such as the 1886 haymarket riot, 1921 sacco and vanzetti case, 1921 kronstadt naval base rebellion, and an oral history of the movement. as an ally of the movement\'s major figures, he sought to challenge the portrayal of anarchists as amoral and violent, and collected papers from these figures that he donated as a 20,000-item collection to the library of congress."}', "{'id': '11054', 'text': 'fascism is a form of radical authoritarian nationalism, characterized by dictatorial power, forcible suppression of opposition, and control of industry and commerce, that came to prominence in early 20th-century europe. the first fascist movements emerged in italy during world war i, before it spread to other european countries. opposed to liberalism, marxism, and anarchism, fascism is usually placed on the far-right within the traditional left–right spectrum.'}", "{'id': '111901', 'text': 'stephen pearl andrews (march 22, 1812 – may 21, 1886) was an american individualist anarchist, linguist, political philosopher, outspoken abolitionist, and author of several books on the labor movement and individualist anarchism.'}", '{\'id\': \'112282\', \'text\': \'johann kaspar schmidt (october 25, 1806 – june 26, 1856), better known as max stirner, was a german philosopher. he is often seen as one of the forerunners of nihilism, existentialism, psychoanalytic theory, postmodernism, and individualist anarchism. stirner\\\'s main work is "the ego and its own", also known as "the ego and his own" ("der einzige und sein eigentum" in german, which translates literally as "the individual and his property"). this work was first published in 1845 in leipzig, and has since appeared in numerous editions and translations.\'}', '{\'id\': \'1192060\', \'text\': \'anarky is a fictional character, appearing in comic books published by dc comics. co-created by alan grant and norm breyfogle, he first appeared in "detective comics" no. 608 (november 1989), as an adversary of batman. introduced as lonnie machin, a child prodigy with knowledge of radical philosophy and driven to overthrow governments to improve social conditions, stories revolving around anarky often focus on political and philosophical themes. the character, who is named after the philosophy of anarchism, primarily espouses anti-statism; however, multiple social issues have been addressed through the character, including environmentalism, antimilitarism, economic inequality, and political corruption. inspired by multiple sources, early stories featuring the character often included homages to political and philosophical books, and referenced anarchist philosophers and theorists. the inspiration for the creation of the character and its early development was based in grant\\\'s personal interest in anti-authoritarian philosophy and politics.\'}', "{'id': '12', 'text': 'anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. these are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. anarchism holds the state to be undesirable, unnecessary and harmful.'}", "{'id': '1200252', 'text': 'the first red scare was a period during the early 20th-century history of the united states marked by a widespread fear of bolshevism and anarchism, due to real and imagined events; real events included those such as the russian revolution and anarchist bombings. at its height in 1919–1920, concerns over the effects of radical political agitation in american society and the alleged spread of communism and anarchism in the american labor movement fueled a general sense of concern if not paranoia.'}"] ... (test_query.py:4446)
[2024-09-11 11:56:41 - INFO - ci_test]: expr: text like '%anarchism%' (test_query.py:4448)
[2024-09-11 11:56:42 - INFO - ci_test]: like match query cost 0.691673755645752 res len 87 res data: ['{\'text\': \'anarchism and violence have become closely connected in popular thought, in part because of a concept of "propaganda of the deed". propaganda of the deed, or "attentát", was espoused by leading anarchists in the late nineteenth century, and was associated with a number of incidents of violence. anarchist thought, however, is quite diverse on the question of violence. in the name of coherence some anarchists have opposed coercion, while others have supported it, particularly in the form of violent revolution on the path to anarchy. anarchism includes a school of thought which rejects all violence (anarcho-pacifism).\', \'id\': \'105859\'}', '{\'text\': \'loompanics unlimited was an american book seller and publisher specializing in nonfiction on generally unconventional or controversial topics. the topics in their title list included drugs, weapons, anarchism, sex, conspiracy theories, and so on. many of their titles describe some kind of illicit or extralegal actions, such as "counterfeit i.d. made easy", while others are purely informative, like "opium for the masses". loompanics was in business for nearly 30 years. the publisher and editor was michael hoy.\', \'id\': \'1062947\'}', "{'text': 'anarchism is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions, but that several authors have defined as more specific institutions based on non-hierarchical free associations. anarchism holds the state to be undesirable, unnecessary, or harmful. while anti-statism is central, anarchism entails opposing authority or hierarchical organisation in the conduct of human relations, including, but not limited to, the state system.', 'id': '1063286'}", '{\'text\': "paul avrich (august 4, 1931 – february 16, 2006) was a historian of the 19th and early 20th century anarchist movement in russia and the united states. he taught at queens college, city university of new york, for his entire career, from 1961 to his retirement as distinguished professor of history in 1999. he wrote ten books, mostly about anarchism, including topics such as the 1886 haymarket riot, 1921 sacco and vanzetti case, 1921 kronstadt naval base rebellion, and an oral history of the movement. as an ally of the movement\'s major figures, he sought to challenge the portrayal of anarchists as amoral and violent, and collected papers from these figures that he donated as a 20,000-item collection to the library of congress.", \'id\': \'1072099\'}', "{'text': 'fascism is a form of radical authoritarian nationalism, characterized by dictatorial power, forcible suppression of opposition, and control of industry and commerce, that came to prominence in early 20th-century europe. the first fascist movements emerged in italy during world war i, before it spread to other european countries. opposed to liberalism, marxism, and anarchism, fascism is usually placed on the far-right within the traditional left–right spectrum.', 'id': '11054'}", "{'text': 'stephen pearl andrews (march 22, 1812 – may 21, 1886) was an american individualist anarchist, linguist, political philosopher, outspoken abolitionist, and author of several books on the labor movement and individualist anarchism.', 'id': '111901'}", '{\'text\': \'johann kaspar schmidt (october 25, 1806 – june 26, 1856), better known as max stirner, was a german philosopher. he is often seen as one of the forerunners of nihilism, existentialism, psychoanalytic theory, postmodernism, and individualist anarchism. stirner\\\'s main work is "the ego and its own", also known as "the ego and his own" ("der einzige und sein eigentum" in german, which translates literally as "the individual and his property"). this work was first published in 1845 in leipzig, and has since appeared in numerous editions and translations.\', \'id\': \'112282\'}', '{\'text\': \'anarky is a fictional character, appearing in comic books published by dc comics. co-created by alan grant and norm breyfogle, he first appeared in "detective comics" no. 608 (november 1989), as an adversary of batman. introduced as lonnie machin, a child prodigy with knowledge of radical philosophy and driven to overthrow governments to improve social conditions, stories revolving around anarky often focus on political and philosophical themes. the character, who is named after the philosophy of anarchism, primarily espouses anti-statism; however, multiple social issues have been addressed through the character, including environmentalism, antimilitarism, economic inequality, and political corruption. inspired by multiple sources, early stories featuring the character often included homages to political and philosophical books, and referenced anarchist philosophers and theorists. the inspiration for the creation of the character and its early development was based in grant\\\'s personal interest in anti-authoritarian philosophy and politics.\', \'id\': \'1192060\'}', "{'text': 'anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. these are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. anarchism holds the state to be undesirable, unnecessary and harmful.', 'id': '12'}", "{'text': 'the first red scare was a period during the early 20th-century history of the united states marked by a widespread fear of bolshevism and anarchism, due to real and imagined events; real events included those such as the russian revolution and anarchist bombings. at its height in 1919–1920, concerns over the effects of radical political agitation in american society and the alleged spread of communism and anarchism in the american labor movement fueled a general sense of concern if not paranoia.', 'id': '1200252'}"] ... (test_query.py:4455)
[2024-09-11 11:56:42 - INFO - ci_test]: text match cost 0.4805600643157959, like match cost 0.691673755645752 (test_query.py:4457)

test code

    @pytest.mark.tags(CaseLabel.L2)
    def test_query_text_match_vs_like_use_hotpotqa_dataset(self):
        """
        target: test query iterator normal
        method: 1. query iterator
                2. check the result, expect pk
        expected: query successfully
        """
        # 1. initialize with data
        analyzer_params = {
            "tokenizer": "default",
        }
        dim = 32
        default_fields = [
            FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=65535, is_primary=True),
            FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=65535, enable_match=True, analyzer_params=analyzer_params),
            FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535, enable_match=True, analyzer_params=analyzer_params),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim)
        ]
        default_schema = CollectionSchema(fields=default_fields, description="test collection")

        print(f"\nCreate collection for movie dataset...")

        collection_w = self.init_collection_wrap(name=cf.gen_unique_str(prefix), schema=default_schema)
        dataset_name = "hotpotqa"
        docs = load_dataset("Cohere/beir-embed-english-v3", f"{dataset_name}-corpus", split="train")
        batch_size = 5000
        cnt = 0
        data = []
        for doc in docs:
            data.append(
                {
                    "id": doc['_id'],
                    "title": doc['title'].lower(),
                    "text": doc['text'].lower(),
                    "emb": cf.gen_vectors(1, dim)[0]
                }
            )
            if len(data) >= batch_size:
                collection_w.insert(data)
                log.info(f"batch insert finished")
                # collection_w.flush()
                cnt += len(data)
                data = []
            if cnt >= 100*batch_size:
                break
        log.info(f"count {cnt}")
        df = pd.DataFrame(docs[:5*batch_size])
        collection_w.flush()
        collection_w.create_index("emb", {"index_type": "IVF_SQ8", "metric_type": "L2", "params": {"nlist": 64}})
        collection_w.load()
        # analyze the croup and get the tf-idf, then base on it to crate expr and ground truth
        title_word_freq = cf.analyze_documents(df["title"].tolist())
        text_word_freq = cf.analyze_documents(df["text"].tolist())
        wf_map = {
            "title": title_word_freq,
            "text": text_word_freq,
        }
        text_fields = ["title", "text"]
        # query single field for one word
        for field in text_fields:
            text_match_expr = f"TextMatch({field}, '{list(wf_map[field].keys())[0]}')"
            like_match_expr = f"{field} like '%{list(wf_map[field].keys())[0]}%'"
            log.info(f"expr: {text_match_expr}")
            t0 = time.time()
            res, _ = collection_w.query(
                expr=text_match_expr,
                output_fields=["id", field]
            )
            tt = time.time() - t0
            log.info(f"text match query cost {tt} res len {len(res)} res {res}")
            text_match_tt = tt
            log.info(f"expr: {like_match_expr}")
            t0 = time.time()
            res, _ = collection_w.query(
                expr=like_match_expr,
                output_fields=["id", field]
            )
            tt = time.time() - t0
            log.info(f"like match query cost {tt} res len {len(res)} res {res}")
            like_match_tt = tt
            log.info(f"text match cost {text_match_tt}, like match cost {like_match_tt}")

Expected Behavior

The performance of text match much better than like

Steps To Reproduce

No response

Milvus Log

log.log

Anything else?

No response

zhuwenxing commented 1 week ago

/assign @longjiquan /assign @sunby

PTAL

zhuwenxing commented 1 week ago
[2024-09-11 19:20:35 - INFO - ci_test]: expr: TextMatch(title, 'fight') (test_query.py:4532)
[2024-09-11 19:20:37 - INFO - ci_test]: text match query cost 1.6304931640625 res len 11293 res data: ["{'id': '0', 'title': 'fight break free set shoulder last it. fire management economic they quite most camera total.\\nmrs rise citizen owner. hope recognize stand his book money.'}", "{'id': '100024', 'title': 'rate color age time order. pm law interesting itself laugh.\\nresult stage manage must fight whatever head thus. hand network white color community almost owner top.'}", "{'id': '10005', 'title': 'pass produce enough may answer. allow turn fight just and whole.\\nprepare tell heavy garden newspaper month. seven music lawyer thus those amount glass. particular while color seem.'}", "{'id': '100077', 'title': 'baby relate hand herself many. improve event feeling. nothing right fight truth ask fine.\\nfilm north i appear. current as door yet work degree town. finally individual expect forward member.'}", "{'id': '100106', 'title': 'front international answer so fight military. unit political knowledge become name town. reason six ok never half less bad.'}", "{'id': '100204', 'title': 'protect state like performance. similar finish claim character maybe. situation chair throw fight care draw international factor.'}", "{'id': '100222', 'title': 'account physical finally only republican drive firm. place site drug hear anything measure.\\nstage view keep step candidate similar. fight letter blood think hundred particular individual.'}", "{'id': '100326', 'title': 'where face the pattern. each team while wait idea can worry executive.\\nmeeting about themselves old economic. sense food choice fight will.'}", "{'id': '100348', 'title': 'him thing beyond exactly. fight main key recent recently. different surface material development value rather state school.'}", "{'id': '100372', 'title': 'story network agreement page body social. fight modern or respond doctor just development.\\nfigure look picture small down. just speech believe expect may some. describe defense body key.'}"] ... (test_query.py:4539)
[2024-09-11 19:20:37 - INFO - ci_test]: expr: title like '%fight%' (test_query.py:4541)
[2024-09-11 19:20:39 - INFO - ci_test]: like match query cost 1.6840407848358154 res len 11293 res data: ["{'id': '0', 'title': 'fight break free set shoulder last it. fire management economic they quite most camera total.\\nmrs rise citizen owner. hope recognize stand his book money.'}", "{'id': '100024', 'title': 'rate color age time order. pm law interesting itself laugh.\\nresult stage manage must fight whatever head thus. hand network white color community almost owner top.'}", "{'id': '10005', 'title': 'pass produce enough may answer. allow turn fight just and whole.\\nprepare tell heavy garden newspaper month. seven music lawyer thus those amount glass. particular while color seem.'}", "{'id': '100077', 'title': 'baby relate hand herself many. improve event feeling. nothing right fight truth ask fine.\\nfilm north i appear. current as door yet work degree town. finally individual expect forward member.'}", "{'id': '100106', 'title': 'front international answer so fight military. unit political knowledge become name town. reason six ok never half less bad.'}", "{'id': '100204', 'title': 'protect state like performance. similar finish claim character maybe. situation chair throw fight care draw international factor.'}", "{'id': '100222', 'title': 'account physical finally only republican drive firm. place site drug hear anything measure.\\nstage view keep step candidate similar. fight letter blood think hundred particular individual.'}", "{'id': '100326', 'title': 'where face the pattern. each team while wait idea can worry executive.\\nmeeting about themselves old economic. sense food choice fight will.'}", "{'id': '100348', 'title': 'him thing beyond exactly. fight main key recent recently. different surface material development value rather state school.'}", "{'id': '100372', 'title': 'story network agreement page body social. fight modern or respond doctor just development.\\nfigure look picture small down. just speech believe expect may some. describe defense body key.'}"] ... (test_query.py:4548)
[2024-09-11 19:20:39 - INFO - ci_test]: text match cost 1.6304931640625, like match cost 1.6840407848358154 (test_query.py:4550)
[2024-09-11 19:20:39 - INFO - ci_test]: expr: TextMatch(text, 'sure') (test_query.py:4532)
[2024-09-11 19:21:25 - INFO - ci_test]: text match query cost 46.57398080825806 res len 0 res data: []  (test_query.py:4539)
[2024-09-11 19:21:25 - INFO - ci_test]: expr: text like '%sure%' (test_query.py:4541)
[2024-09-11 19:21:26 - INFO - ci_test]: like match query cost 0.5035851001739502 res len 0 res data: []  (test_query.py:4548)
[2024-09-11 19:21:26 - INFO - ci_test]: text match cost 46.57398080825806, like match cost 0.5035851001739502 (test_query.py:4550)

when no result is hit. then the perf of text match is very bad log: querynode.log

You can also get log from loki cluster: 4am ns: chaos-testing pod

text-match-test-v12-etcd-0                                  1/1     Running       0               172m    10.104.21.228   4am-node24   <none>           <none>
text-match-test-v12-etcd-1                                  1/1     Running       0               172m    10.104.23.29    4am-node27   <none>           <none>
text-match-test-v12-etcd-2                                  1/1     Running       0               172m    10.104.30.161   4am-node38   <none>           <none>
text-match-test-v12-kafka-0                                 2/2     Running       2 (172m ago)    172m    10.104.21.231   4am-node24   <none>           <none>
text-match-test-v12-kafka-1                                 2/2     Running       2 (171m ago)    172m    10.104.23.32    4am-node27   <none>           <none>
text-match-test-v12-kafka-2                                 2/2     Running       2 (171m ago)    172m    10.104.19.3     4am-node28   <none>           <none>
text-match-test-v12-kafka-exporter-7bf5767c58-pqbsc         1/1     Running       5 (171m ago)    172m    10.104.14.65    4am-node18   <none>           <none>
text-match-test-v12-kafka-zookeeper-0                       1/1     Running       0               172m    10.104.21.230   4am-node24   <none>           <none>
text-match-test-v12-kafka-zookeeper-1                       1/1     Running       0               172m    10.104.23.33    4am-node27   <none>           <none>
text-match-test-v12-kafka-zookeeper-2                       1/1     Running       0               172m    10.104.19.4     4am-node28   <none>           <none>
text-match-test-v12-milvus-datanode-7654d47bf5-v86jq        1/1     Running       0               170m    10.104.14.66    4am-node18   <none>           <none>
text-match-test-v12-milvus-indexnode-76b6886fd5-j4wxl       1/1     Running       0               150m    10.104.14.92    4am-node18   <none>           <none>
text-match-test-v12-milvus-mixcoord-5568d97cf-87f9x         1/1     Running       0               170m    10.104.13.180   4am-node16   <none>           <none>
text-match-test-v12-milvus-proxy-658db7f88c-cdwx9           1/1     Running       0               170m    10.104.1.10     4am-node10   <none>           <none>
text-match-test-v12-milvus-querynode-0-965b9cc44-rn2cr      1/1     Running       1 (4m49s ago)   170m    10.104.4.25     4am-node11   <none>           <none>
text-match-test-v12-minio-0                                 1/1     Running       0               172m    10.104.21.229   4am-node24   <none>           <none>
text-match-test-v12-minio-1                                 1/1     Running       0               172m    10.104.30.163   4am-node38   <none>           <none>
text-match-test-v12-minio-2                                 1/1     Running       0               172m    10.104.19.253   4am-node28   <none>           <none>
text-match-test-v12-minio-3                                 1/1     Running       0               172m    10.104.23.30    4am-node27   <none>           <none>
longjiquan commented 4 days ago

We should use lower consistency level to test the performance. /unassign /assign @zhuwenxing