opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.77k stars 1.82k forks source link

[BUG] DocValuesField "my_field" appears more than once #7145

Open lucafrost opened 1 year ago

lucafrost commented 1 year ago

Describe the bug When attempting to use the bulk API to update documents, the update is rejected with an illegal argument exception stating DocValuesField "description.vector" appears more than once in this document (only one value is allowed per field.

To Reproduce See the below Python script...

action = { "update": { "_index": "my_index", "_id": doc_id }}
source = {
    "doc": {
        "title.vector": embeddings[0].tolist(),
        "description.vector": embeddings[1].tolist(),
        "scrape.rawText.vector": embeddings[2].tolist()
    }
}

payload = ""

payload += json.dumps(action) + "\n"
payload += json.dumps(source) + "\n"

client.bulk(body=payload)

Plugins All of 'em -- see AWS OSS Plugins, full output from _cat/plugins is here.

Host/Environment (please complete the following information):

dtaivpp commented 1 year ago

Could you give a sample of what is in the embeddings[1-3]? That may help us work out what is going on.

lucafrost commented 1 year ago

hey @dtaivpp — appreciate the quick response!

embeddings[0-2] are 768-dimensional embeddings generated for semantic search to be used with the kNN plugin.

updated example below...

emb1 = [0, 1, 2, ... 768]
emb2 = [0, 1, 2, ... 768]

action = { "update": { "_index": "my_index", "_id": doc_id }}
source = {
    "doc": {
        "title.vector": emb1,
        "description.vector": emb2
    }
}

payload = ""
payload += json.dumps(action) + "\n"
payload += json.dumps(source) + "\n"

client.bulk(body=payload)

let me know if that helps at all? 🙇🏼‍♂️

dblock commented 1 year ago

Without looking at the problem, if anyone has time, it would be helpful to try to get to a smaller/simpler end-to-end repro and turn it into a failing unit test.