milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: [bulk load]Bulk load raise error `Numpy parse: illegal data type <f8 for field float_scalar` when data type is match with filed schema #19696

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 2 years ago

Is there an existing issue for this?

Environment

- Milvus version: master-20221010-85e04d84
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.0.dev36
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

'failed_reason': 'Numpy parse: illegal data type <f8 for field float_scalar'

[2022-10-11 11:55:28 - DEBUG - ci_test]: (api_request)  : [get_bulk_load_state] args: [436573894077518555, 90, 'default'], kwargs: {} (api_request.py:56)
[2022-10-11 11:55:28 - DEBUG - ci_test]: (api_response) : <Bulk load state:
    - taskID          : 436573894077518555,
    - state           : Failed,
    - row_count       : 0,
    - infos           : {'files': 'uid.npy,vectors.npy,float_scalar.npy', 'collection': 'test_19QsHiG6', 'partition': '_default', 'failed_reason': 'Numpy parse: illegal data type ......  (api_request.py:31)
[2022-10-11 11:55:28 - INFO - ci_test]: after waiting, there are 0 pending tasks (utility_wrapper.py:144)
[2022-10-11 11:55:28 - INFO - ci_test]: task state distribution: {'success': set(), 'failed': {436573894077518555}, 'in_progress': set()} (utility_wrapper.py:145)
[2022-10-11 11:55:28 - DEBUG - ci_test]: {436573894077518555: <Bulk load state:
    - taskID          : 436573894077518555,
    - state           : Failed,
    - row_count       : 0,
    - infos           : {'files': 'uid.npy,vectors.npy,float_scalar.npy', 'collection': 'test_19QsHiG6', 'partition': '_default', 'failed_reason': 'Numpy parse: illegal data type <f8 for field float_scalar'},
    - id_ranges       : [],
    - create_ts       : 2022-10-11 11:55:27
>} (utility_wrapper.py:146)
[2022-10-11 11:55:28 - INFO - ci_test]: wait for bulk load tasks completed failed, cost time: 2.065891981124878 (utility_wrapper.py:151)
[2022-10-11 11:55:28 - INFO - ci_test]: bulk load state:False in 3.241753101348877 (test_bulk_load.py:1156)

Expected Behavior

all test cases passed

Steps To Reproduce

@pytest.mark.tags(CaseLabel.L3)
    @pytest.mark.parametrize("auto_id", [True])
    @pytest.mark.parametrize("dim", [128])  # 128
    @pytest.mark.parametrize("entities", [1000])  # 1000
    @pytest.mark.parametrize("with_int_field", [True, False])
    def test_with_uid_n_int_numpy(self, auto_id, dim, entities, with_int_field):
        """
        collection schema 1: [pk, float_vector]
        data file: vectors.npy and uid.npy
        Steps:
        1. create collection
        2. import data
        3. verify failed with errors
        """
        data_fields = [df.pk_field, df.vec_field, df.float_field]
        fields = [
            cf.gen_int64_field(name=df.pk_field, is_primary=True),
            cf.gen_float_field(name=df.float_field),
            cf.gen_float_vec_field(name=df.vec_field, dim=dim),
        ]
        # if not auto_id:
        #     data_fields.append(df.pk_field)
        if with_int_field:
            data_fields.append(df.int_field)
            fields.append(cf.gen_int64_field(name=df.int_field))
        files = prepare_bulk_load_numpy_files(
            minio_endpoint=self.minio_endpoint,
            bucket_name=self.bucket_name,
            rows=entities,
            dim=dim,
            data_fields=data_fields,
            force=True,
        )
        self._connect()
        c_name = cf.gen_unique_str()
        schema = cf.gen_collection_schema(fields=fields, auto_id=auto_id)
        self.collection_wrap.init_collection(c_name, schema=schema)

        # import data
        t0 = time.time()
        task_ids, _ = self.utility_wrap.bulk_load(
            collection_name=c_name, is_row_based=False, files=files
        )
        logging.info(f"bulk load task ids:{task_ids}")
        success, states = self.utility_wrap.wait_for_bulk_load_tasks_completed(
            task_ids=task_ids, timeout=90
        )
        tt = time.time() - t0
        log.info(f"bulk load state:{success} in {tt}")
        assert success
        num_entities = self.collection_wrap.num_entities
        log.info(f" collection entities: {num_entities}")
        assert num_entities == entities

        # verify imported data is available for search
        self.collection_wrap.load()
        # log.info(f"query seg info: {self.utility_wrap.get_query_segment_info(c_name)[0]}")
        search_data = cf.gen_vectors(1, dim)
        search_params = {"metric_type": "L2", "params": {"nprobe": 2}}
        res, _ = self.collection_wrap.search(
            search_data,
            df.vec_field,
            param=search_params,
            limit=1,
            check_task=CheckTasks.check_search_results,
            check_items={"nq": 1, "limit": 1},
        )


### Milvus Log

client log:
[client.log](https://github.com/milvus-io/milvus/files/9751028/client.log)

server log:

[test-milvus-bulk-load-10-11-12-11.zip](https://github.com/milvus-io/milvus/files/9751078/test-milvus-bulk-load-10-11-12-11.zip)

### Anything else?

If schema only has int and vertor_float fields, this test case can pass.

The error is `illegal data type <f8 for field float_scalar`, but I checked the data type for float_scalar.npy, the data type is float 64, so it should be matched.
zhuwenxing commented 2 years ago

/assign @yhmo /unassign @yanliang567

zhuwenxing commented 2 years ago

The data type for float scalar is float, but from error message , it seems be checked as <f8 data type.

zhuwenxing commented 2 years ago

After all, It was still caused by the datatype mismatch. In Milvus, DataType.FLOAT means float32, but I set it as float64 because it is the default data type for float in NumPy.

@yhmo But I think the error message is not that readable:

  1. the message just shows the data type is illegal but did not point out the expected data type.
  2. <f8 is not human readable, it would be better to change it to float64
yhmo commented 2 years ago

Now the failed_reason become: illegal data type Double of numpy file for scalar field 'xxx' with type Float

yhmo commented 2 years ago

20176 has been merged into main branch