milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.13k stars 2.89k forks source link

[Bug]: Creating a collection still succeeds even when using an unsupported tokenizer in the schema and Jieba tokenizer is not working.. #36751

Open zhuwenxing opened 2 weeks ago

zhuwenxing commented 2 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:zhengbuqian-doc-in-restful-d174d05-20241010
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


-------------------------------- live log call ---------------------------------
[2024-10-10 19:28:13 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)
[2024-10-10 19:28:13 - DEBUG - ci_test]: (api_response) : False  (api_request.py:37)
[2024-10-10 19:28:13 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.104.4.62', 'port': 19530} (api_request.py:62)
[2024-10-10 19:28:13 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-10 19:28:13 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['full_text_search_collection_smwEFsbH', {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': ......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)
[2024-10-10 19:28:14 - DEBUG - ci_test]: (api_response) : <Collection>:
-------------
<name>: full_text_search_collection_smwEFsbH
<description>: test collection
<schema>: {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'de......  (api_request.py:37)
[2024-10-10 19:28:14 - DEBUG - ci_test]: (api_request)  : [Collection.describe] args: [180], kwargs: {} (api_request.py:62)
[2024-10-10 19:28:14 - DEBUG - ci_test]: (api_response) : {'collection_name': 'full_text_search_collection_smwEFsbH', 'auto_id': False, 'num_shards': 1, 'description': 'test collection', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}, 'is_primary': True}, {'field_id': 101, 'name': 'word', 'descriptio......  (api_request.py:37)
[2024-10-10 19:28:14 - INFO - ci_test]: collection describe {'collection_name': 'full_text_search_collection_smwEFsbH', 'auto_id': False, 'num_shards': 1, 'description': 'test collection', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}, 'is_primary': True}, {'field_id': 101, 'name': 'word', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_tokenizer': 'true', 'tokenizer_params': '{"tokenizer":"unsupported"}'}, 'is_partition_key': True}, {'field_id': 102, 'name': 'sentence', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_tokenizer': 'true', 'tokenizer_params': '{"tokenizer":"unsupported"}'}}, {'field_id': 103, 'name': 'paragraph', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_tokenizer': 'true', 'tokenizer_params': '{"tokenizer":"unsupported"}'}}, {'field_id': 104, 'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535, 'enable_tokenizer': 'true', 'tokenizer_params': '{"tokenizer":"unsupported"}'}}, {'field_id': 105, 'name': 'emb', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}, {'field_id': 106, 'name': 'text_sparse_emb', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'params': {}, 'is_function_output': True}, {'field_id': 107, 'name': 'paragraph_sparse_emb', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>, 'params': {}, 'is_function_output': True}], 'functions': [{'name': 'text_bm25_emb', 'id': 100, 'description': '', 'type': <FunctionType.BM25: 1>, 'params': {}, 'input_field_names': ['text'], 'input_field_ids': [104], 'output_field_names': ['text_sparse_emb'], 'output_field_ids': [106]}, {'name': 'paragraph_bm25_emb', 'id': 101, 'description': '', 'type': <FunctionType.BM25: 1>, 'params': {}, 'input_field_names': ['paragraph'], 'input_field_ids': [103], 'output_field_names': ['paragraph_sparse_emb'], 'output_field_ids': [107]}], 'aliases': [], 'collection_id': 453124184567726198, 'consistency_level': 0, 'properties': {}, 'num_partitions': 16, 'enable_dynamic_field': False} (test_full_text_search.py:158)
FAILED
testcases/test_full_text_search.py:100 (TestCreateCollectionWithFullTextSearchNegative.test_create_collection_for_full_text_search_with_unsupported_tokenizer[unsupported])
self = <test_full_text_search.TestCreateCollectionWithFullTextSearchNegative object at 0x13275c940>
tokenizer = 'unsupported'

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("tokenizer", ["unsupported"])
    def test_create_collection_for_full_text_search_with_unsupported_tokenizer(self, tokenizer):
        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=True,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
            FieldSchema(name="paragraph_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection")
        text_fields = ["text", "paragraph"]
        for field in text_fields:
            bm25_function = Function(
                name=f"{field}_bm25_emb",
                function_type=FunctionType.BM25,
                input_field_names=[field],
                output_field_names=[f"{field}_sparse_emb"],
                params={},
            )
            schema.add_function(bm25_function)
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        res, result = collection_w.describe()
        log.info(f"collection describe {res}")
>       assert not result, "create collection with unsupported tokenizer should be failed"
E       AssertionError: create collection with unsupported tokenizer should be failed
E       assert not True

Expected Behavior

create collection with unsupported tokenizer should be failed

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

zhuwenxing commented 2 weeks ago

/assign @zhengbuqian

zhengbuqian commented 2 weeks ago

/assign @aoiasd is working on this

zhengbuqian commented 2 weeks ago

/assign @aoiasd

zhuwenxing commented 6 days ago

this issue is caused by tokenizer params not correctly checked and used. so set it as critical issue