milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.9k stars 2.95k forks source link

[Bug]: inverted index does not support string longer than 65530 #37855

Open sunby opened 3 days ago

sunby commented 3 days ago

Is there an existing issue for this?

Environment

- Milvus version: master
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

If you insert strings longer than 65530, milvus will not return warnings or errors but tantivy's log will print warning. img_v3_02gq_0c79e9b7-0011-41a8-aba0-00892d2c03eg

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

zhuwenxing commented 3 days ago

insert strings longer than 65530

Is this referring to the length of the entire string or the length of a single word?

sunby commented 3 days ago

insert strings longer than 65530

Is this referring to the length of the entire string or the length of a single word?

it's a const variable MAX_TOKEN_LEN in tantivy, so I think it's a word.

sunby commented 3 days ago

I will do some tests to check what's the influence for query using inverted index.

sunby commented 3 days ago

/assign

xiaofan-luan commented 3 days ago

how can a token be that long?

We do need to tune the max varchar length field. is there anything stop us from increasing varchar to 256K or 1M?

yanliang567 commented 3 days ago

/assign @sunby /unassign

sunby commented 3 days ago

how can a token be that long?

We do need to tune the max varchar length field. is there anything stop us from increasing varchar to 256K or 1M?

We tested inverted index with 65535 length string and this warning occured.

sunby commented 2 days ago

I write an unit test to verify it. And strings longer than 65530 can not be searched because they are dropped in tantivy.

https://github.com/quickwit-oss/tantivy/blob/c71ea7b2effe6494c1a2b9e234db71f89255c06a/src/postings/postings_writer.rs#L139-L148

sunby commented 2 days ago

how can a token be that long?

We do need to tune the max varchar length field. is there anything stop us from increasing varchar to 256K or 1M?

We use "raw" tokenizer which means no tokenizer in tantivy.

xiaofan-luan commented 2 days ago

how can a token be that long? We do need to tune the max varchar length field. is there anything stop us from increasing varchar to 256K or 1M?

We use "raw" tokenizer which means no tokenizer in tantivy.

This seems to be a non blocker issue.

Is there a blocking issue if we want to grow the size of varchar to 256k? like we use some smaller bits for a size