[Bug]: Can't add to collection due to error "The data in the same column must be of the same type"

NasonZ commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.2.12
- Deployment mode(standalone or cluster): Stand alone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): tried both [pymilvus-2.2.14, pymilvus-2.2.13]
- OS(Ubuntu or CentOS): Ubuntu 22.04
- CPU/Memory: Ryzen 9 3900X / 32GB 
- GPU: GTX 3080
- Others:

Current Behavior

When I try to add data to my collection I get this error:

RPC error: [insert_rows], <DataNotMatchException: (code=1, message=The data in the same column must be of the same type.)>

Expected Behavior

I expect my data to be added to my collection as shown in documentation - https://milvus.io/docs/insert_data.md

Steps To Reproduce

#script to reproduce the error

import numpy as np 

dummy_data = [{'title': 'Varied jokes',
   'url': 'https://jokesRus.com',
   'snippets': "Why don't scientists trust atoms? Because they make up everything!",
   'embedding': np.random.rand(768).tolist()},
{'title': 'Varied jokes',
 'url': 'https://jokesRus.com',
 'snippets': "I'm reading a book about anti-gravity. It's impossible to put down!",
 'embedding': np.random.rand(768).tolist()},
 {'title': 'General jokes',
 'url': 'https://Fuknee.com',
 'snippets': "Did you hear about the mathematician who's afraid of negative numbers? He'll stop at nothing to avoid them!",
 'embedding': np.random.rand(768).tolist()}]

#dummy_data

from pymilvus import connections
connections.connect(
  alias="default",
  user='username',
  password='password',
  host='localhost',
  port='19530'
)

from pymilvus import Collection, DataType, FieldSchema, CollectionSchema, connections

#create fields
_id = FieldSchema(
  name="id",
  dtype=DataType.INT64,
  is_primary=True,
)
page_title = FieldSchema(
  name="title",
  dtype=DataType.VARCHAR,  #STRING gives SchemaNotReadyException error
  max_length=200
)
page_url = FieldSchema(
    name="url", 
    dtype=DataType.VARCHAR,   
    max_length=200
)
snippets = FieldSchema(name="snippets",
                       dtype=DataType.VARCHAR,
                       max_length=200 
)  
embedding = FieldSchema(name="embedding", 
                        dtype=DataType.FLOAT_VECTOR, 
                        dim=768)

# Create the collection schema
schema = CollectionSchema(
  fields=[_id, page_title, page_url, snippets, embedding], #[_id, page_title, page_url, p_type, app, id_in_app, parent_id_in_app, date_created, date_last_edit, snippets, embedding],[page_title, page_url, embedding]
  description="Dummy Dataset Collectio",
  enable_dynamic_field=True
)

# Create the collection
collection_name = "dummy_dataset"

#create collection
dummy_collection = Collection(
    name=collection_name,
    schema=schema,
    using='default',
    shards_num=2
    )

# Define the fields to test
fields_to_test = ['id', 'title', 'url', 'snippets', 'embedding']

# Initialize an empty string to store the error messages
error_log = ""

# Loop through each field and try inserting data one field at a time
for field in fields_to_test:
    try:
        data = {field: [record[field] for record in dummy_data]}
        inserted_ids = dummy_collection.insert(data)
    except Exception as e:
        error_log += f"Error for field '{field}': {str(e)}\n"

print(error_log)


### Milvus Log

RPC error: [insert_rows], <DataNotMatchException: (code=1, message=The data in the same column must be of the same type.)>, <Time:{'RPC start': '2023-07-25 21:13:58.810130', 'RPC error': '2023-07-25 21:13:58.810288'}> RPC error: [insert_rows], <DataNotMatchException: (code=1, message=The data in the same column must be of the same type.)>, <Time:{'RPC start': '2023-07-25 21:13:58.811117', 'RPC error': '2023-07-25 21:13:58.811193'}> RPC error: [insert_rows], <DataNotMatchException: (code=1, message=The data in the same column must be of the same type.)>, <Time:{'RPC start': '2023-07-25 21:13:58.811776', 'RPC error': '2023-07-25 21:13:58.811878'}> RPC error: [insert_rows], <DataNotMatchException: (code=1, message=The data in the same column must be of the same type.)>, <Time:{'RPC start': '2023-07-25 21:13:58.812349', 'RPC error': '2023-07-25 21:13:58.813201'}>



### Anything else?

Running milvus via docker from terminal and I'm using a jupyter notebook in vs code to execute the sdk code given above.

yanliang567 commented 1 year ago

collection.insert() supports column based list data, and does not support inserting the data one field by one field.
the code snippet above does not have id(int64) data in the dymmy_data
please refer to https://milvus.io/api-reference/pymilvus/v2.2.x/Collection/insert().md for reference.

I updated the code as below, and it works well:


import numpy as np

#dummy_data

from pymilvus import connections
connections.connect(
  alias="default",
  user='username',
  password='password',
  host='1xx.xx.x.x',
  port='19530'
)

from pymilvus import Collection, DataType, FieldSchema, CollectionSchema, connections

#create fields
_id = FieldSchema(
  name="id",
  dtype=DataType.INT64,
  is_primary=True,
)
page_title = FieldSchema(
  name="title",
  dtype=DataType.VARCHAR,  #STRING gives SchemaNotReadyException error
  max_length=200
)
page_url = FieldSchema(
    name="url",
    dtype=DataType.VARCHAR,
    max_length=200
)
snippets = FieldSchema(name="snippets",
                       dtype=DataType.VARCHAR,
                       max_length=200
)
embedding = FieldSchema(name="embedding",
                        dtype=DataType.FLOAT_VECTOR,
                        dim=768)

# Create the collection schema
schema = CollectionSchema(
  fields=[_id, page_title, page_url, snippets, embedding], #[_id, page_title, page_url, p_type, app, id_in_app, parent_id_in_app, date_created, date_last_edit, snippets, embedding],[page_title, page_url, embedding]
  description="Dummy Dataset Collectio",
  enable_dynamic_field=True
)

# Create the collection
collection_name = "dummy_dataset"

#create collection
dummy_collection = Collection(
    name=collection_name,
    schema=schema,
    using='default',
    shards_num=2
    )

dummy_data2 = \
    [
    [1, 2, 3],
    ['Varied jokes', 'Varied jokes', 'General jokes'],
    ['https://jokesRus.com', 'https://jokesRus.com', 'https://Fuknee.com'],
    ["Why don't scientists trust atoms?", "I'm reading a book about anti-gravity.","Did you hear about the mathematician"],
    [np.random.rand(768).tolist(), np.random.rand(768).tolist(), np.random.rand(768).tolist()]
    ]

inserted_ids = dummy_collection.insert(dummy_data2)
print(inserted_ids)

yanliang567 commented 1 year ago

/assign @NasonZ /unassign

NasonZ commented 1 year ago

Thanks for the response. I want to embed each snippet within a json and store it in Milvus along with its metadata. Just so I know I'm on the right path, based on your example the correct way to do this would be:

Source:

{ 
    "title": "Varied jokes",
    "url": "https://jokesRus.com",
    "type": "page",
    "app": "safari",
    "id_in_app": "14680104",
    "parent_id_in_app": "Root",
    "date_created": "2023-03-18T18:57:49.635Z",
    "date_last_edit": "2023-03-20T17:48:29.821Z",
    "snippets": [
      {
        "topic": "-",
        "content": "Why don't scientists trust atoms? Because they make up everything!",
        "references": []
      },
      {
        "topic": "-",
        "content": "I'm reading a book about anti-gravity. It's impossible to put down!",
        "references": []
      },
      {
        "topic": "-",
        "content": "Why don't skeletons fight each other? They don't have the guts!",
        "references": []
      },
      {
        "topic": "-",
        "content": "Why did the scarecrow win an award? Because he was outstanding in his field!",
        "references": []
      },
      {
        "topic": "-",
        "content": "Did you hear about the mathematician who's afraid of negative numbers? He'll stop at nothing to avoid them!",
        "references": []
      },...

Milvus entry:

[
    [1, 2, 3,4,5], # unique ID for each snippet
    ['Varied jokes', 'Varied jokes', 'Varied jokes', 'Varied jokes', 'Varied jokes'],  # metadata from the source json is repeated n number of snippets
    ['https://jokesRus.com', 'https://jokesRus.com', 'https://jokesRus.com', 'https://jokesRus.com', 'https://jokesRus.com'],
    ["Why don't scientists trust atoms?", "I'm reading a book about anti-gravity.","Why don't skeletons fight each other?", "Why did the scarecrow win an award?", "Did you hear about the mathematician"],
    [np.random.rand(768).tolist(), np.random.rand(768).tolist(), np.random.rand(768).tolist(), np.random.rand(768).tolist(), np.random.rand(768).tolist()]
    ]

I want to be sure I understand how to insert document chunks and their metadata.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

milvus-io / milvus