vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 586 forks source link

Incorrect response while querying array in vespa #22321

Closed 107dipan closed 2 years ago

107dipan commented 2 years ago

Describe the bug We have added an array of struct field in our vespa schema. Our struct has a string key and value. We want to write a query such that the key and value matches with a particular array elememnt . Please let us know if this is expected behavior.

{ "yql": "select from sources where weakAnd(field_s contains sameElement(key contains \"wildCardKeyp_s\", value contains \"wildCardFieldValueg\"),field_s contains sameElement(key contains \"wildCardKeyp_s\", value contains \"wildCardFieldValuev\"));", "timeout": "120s", "hits": 50 }

But in the response payload we are getting documents where the key and value are not part of same struct, Example - "field_s": { "wildCardKeym_s": "wildCardFieldValuel", "wildCardKeyb_s": "wildCardFieldValuea", "wildCardKeyp_s": "wildCardFieldValuer", "wildCardKeys_s": "wildCardFieldValued", "wildCardKeyw_s": "wildCardFieldValuet", "wildCardKeyy_s": "wildCardFieldValuel", "wildCardKeyj_s": "wildCardFieldValuep" }

Expected behavior We wanted to retrieve documents that have array elements {key:wildCardKeyp_s, value: wildCardFieldValueg} or {key:wildCardKeyp_s, value: wildCardFieldValuev}

Environment (please complete the following information):

jobergum commented 2 years ago

I'm not sure what you expect here by using weakAnd, can you simplify this report by

A) what does a single sameElement operator do, and does it match your expectations 2) What does sameElement() or sameElement do

And how is the array of struct field defined?

107dipan commented 2 years ago

struct wildcard_string { field key type string { } field value type string { } }

field field_s type array { indexing: summary struct-field key { indexing: attribute attribute: fast-search }

       struct-field value {
            indexing: attribute
            attribute: fast-search
       }           

}

A) I want to retrieve documents where the key and value are in the same struct. Example if I am searching with field_s contains sameElement(key contains x, value contains y) it should only match with documents that have an array element {key: x, value: y} B) I want to retrieve docs that contain docs that contains either {key: x, value: y} or {key: x, value: z}

jobergum commented 2 years ago

Yes, that is what sameElement does. Please try as I said with

1) Using 1 sameElement and see if that match your expectations 2) Using 2 sameElement and connect them using logical or.

I'm not sure if weakAnd is what you want here, weakAnd is a very special query operator as described in https://docs.vespa.ai/en/using-wand-with-vespa.html to be used for text search matching.

107dipan commented 2 years ago

{ "yql": "select from sources where field_s contains sameElement(key contains \"wildCardKeyb_s\", value contains \"wildCardFieldValuev\") OR field_s contains sameElement(key contains \"wildCardKeyb_s\", value contains \"wildCardFieldValueo\");", "timeout": "120s", "hits": 50 }

Response had this document - field_s": { "wildCardKeyl_s": "wildCardFieldValues", "wildCardKeyb_s": "wildCardFieldValuen", "wildCardKeyo_s": "wildCardFieldValuee", "wildCardKeyj_s": "wildCardFieldValueo", "wildCardKeyy_s": "wildCardFieldValues", "wildCardKeyi_s": "wildCardFieldValuey", "wildCardKeye_s": "wildCardFieldValuem", "wildCardKeyg_s": "wildCardFieldValuer" },

{ "yql": "select from sources where field_s contains sameElement(key contains \"wildCardKeyb_s\", value contains \"wildCardFieldValuev\");", "timeout": "120s", "hits": 50 }

Documents in the response field_s": { "wildCardKeyl_s": "wildCardFieldValues", "wildCardKeyb_s": "wildCardFieldValuen", "wildCardKeyo_s": "wildCardFieldValuee", "wildCardKeyj_s": "wildCardFieldValueo", "wildCardKeyy_s": "wildCardFieldValues", "wildCardKeyi_s": "wildCardFieldValuey", "wildCardKeye_s": "wildCardFieldValuem", "wildCardKeyg_s": "wildCardFieldValuer" }

jobergum commented 2 years ago

Is it possible to not use an example that could look like spell mistakes? wildCardKeyy_s. Formatting of the json would also help.

Here is the system test for searching in array of struct using the sameElement query operator.

If you are able to create a reproducible setup it would be great.

107dipan commented 2 years ago

Hey Joe,

We used random field names to populate the arrays. Reproducible Steps would be

  1. Deploy vespa application with array of structs.

struct wildcard_string { field key type string { } field value type string { } }

field field_s type array { indexing: summary

struct-field key {
indexing: attribute
attribute: fast-search

}

struct-field value {
        indexing: attribute
        attribute: fast-search
}          

}

  1. Try searching using yqls similar to the ones mentioned above.
jobergum commented 2 years ago

Again, our tests demonstrate that this works as intended, so it's up to you to demonstrate how to reproduce what you observe.

Here is the system test for searching in array of struct using the sameElement query operator.

Feed file https://github.com/vespa-engine/system-test/blob/master/tests/search/struct_and_map_types/docs_search.json tests https://github.com/vespa-engine/system-test/blob/master/tests/search/struct_and_map_types/struct_and_map_types.rb#L54

nehajatav commented 2 years ago

@jobergum So we realised that the sameElement works perfectly, however the confusion was due to search response garbling up the struct

Ingest payload

my_favourite_food_array_field: [
{ "key":"mango", 
"value":"ladyfinger"},
{ "key":"grapes", 
"value":"chickpea"},
{ "key":"apple", 
"value":"beans"} ]

Search result: Note how struct field name is removed altogether and mapping from one key-value pair to another is messed up. Are key and value special keywords?

my_favourite_food_array_field: 
{ 
"mango":"beans",
"grapes":"ladyfinger",
"apple":"chickpea"}
jobergum commented 2 years ago

Note that matched-elements-only only will display elements that matched the query https://docs.vespa.ai/en/reference/schema-reference.html#summary

nehajatav commented 2 years ago

We are not using matched-elements-only if that's what you mean in the comment above @jobergum ?

jobergum commented 2 years ago

Yes. Feel free to produce a simple schema, a sample document, and a query that reproduces the behavior.

107dipan commented 2 years ago

@jobergum Schema defined ->

    struct wildcard_string {
        field key type string { }
        field value type string { }
    }

    field field_s type array<wildcard_string> {
        indexing: summary
        struct-field key {
             indexing: attribute

       }

       struct-field value {
            indexing: attribute

       }
    }

Doc ingestion payload - { "fields": { "field_s": [ { "key": "fruit", "value": "apple" }, { "key": "fruit", "value": "banana" }, { "key": "fruit", "value": "orange" }, { "key": "food", "value": "burger" }, { "key": "food", "value": "pizza" }, { "key": "food", "value": "pasta" } ], "isNewWildCardFieldAdded": true, "author": [ "person" ], "language": "English", "table": "tableName" } }

Search with docId - { "pathId": "/document/v1/namespace/documentType/docid/701683489", "id": "id:namespace:documentType::701683489", "fields": { "language": "English", "table": "tableName", "eventualIndex": false, "field_s": [ { "value": "apple", "key": "fruit" }, { "value": "banana", "key": "fruit" }, { "value": "orange", "key": "fruit" }, { "value": "burger", "key": "food" }, { "value": "pizza", "key": "food" }, { "value": "pasta", "key": "food" } ], "isNewWildCardFieldAdded": true, "author": [ "person" ] } }

yql search with one sameElement { "root": { "id": "toplevel", "relevance": 1.0, "fields": { "totalCount": 1 }, "coverage": { "coverage": 100, "documents": 6414070, "full": true, "nodes": 228, "results": 38, "resultsFull": 38 }, "children": [ { "id": "id:namespace:documentType::701683489", "relevance": 0.0, "source": "lexdoc", "fields": { "sddocname": "documentType", "documentid": "id:namespace:documentType::701683489", "author": [ "person" ], "language": "English", "table": "tableName", "field_s": { "fruit": "orange", "food": "pasta" }, "eventualIndex": false, "isNewWildCardFieldAdded": true } } ] } }

kkraune commented 2 years ago

I have reproduced the issue, using https://github.com/vespa-engine/sample-apps/tree/master/album-recommendation and added to music.sd:

        struct wildcard_string {
            field key type string { }
            field value type string { }
        }

        field field_s type array<wildcard_string> {
            indexing: summary
            struct-field key {
                 indexing: attribute
            }
            struct-field value {
                indexing: attribute
            }
        }

I ingested this file:

{
  "put": "id:mynamespace:music::fruits",
  "fields": {
    "field_s": [
      {
        "key": "fruit",
        "value": "apple"
      },
      {
        "key": "fruit",
        "value": "banana"
      },
      {
        "key": "fruit",
        "value": "orange"
      },
      {
        "key": "food",
        "value": "burger"
      },
      {
        "key": "food",
        "value": "pizza"
      },
      {
        "key": "food",
        "value": "pasta"
      }
    ]
  }
}

Query:

$ vespa query "select * from music where true"
{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 1
        },
        "coverage": {
            "coverage": 100,
            "documents": 1,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:mynamespace:music::fruits",
                "relevance": 0.0,
                "source": "music",
                "fields": {
                    "sddocname": "music",
                    "documentid": "id:mynamespace:music::fruits",
                    "field_s": {
                        "fruit": "orange",
                        "food": "pasta"
                    }
                }
            }
        ]
    }
}

see that for field_s, we get the last element for fruit and food - it behaves like a map.

I then re-did the experiment, but this time I used kkey and vvalue :

        struct wildcard_string {
            field kkey type string { }
            field vvalue type string { }
        }

        field field_s type array<wildcard_string> {
            indexing: summary
            struct-field kkey {
                 indexing: attribute
            }
            struct-field vvalue {
                indexing: attribute
            }
        }
{
  "put": "id:mynamespace:music::fruits",
  "fields": {
    "field_s": [
      {
        "kkey": "fruit",
        "vvalue": "apple"
      },
      {
        "kkey": "fruit",
        "vvalue": "banana"
      },
      {
        "kkey": "fruit",
        "vvalue": "orange"
      },
      {
        "kkey": "food",
        "vvalue": "burger"
      },
      {
        "kkey": "food",
        "vvalue": "pizza"
      },
      {
        "kkey": "food",
        "vvalue": "pasta"
      }
    ]
  }
}
$ vespa query "select * from music where true"
{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 1
        },
        "coverage": {
            "coverage": 100,
            "documents": 1,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:mynamespace:music::fruits",
                "relevance": 0.0,
                "source": "music",
                "fields": {
                    "sddocname": "music",
                    "documentid": "id:mynamespace:music::fruits",
                    "field_s": [
                        {
                            "kkey": "fruit",
                            "vvalue": "apple"
                        },
                        {
                            "kkey": "fruit",
                            "vvalue": "banana"
                        },
                        {
                            "kkey": "fruit",
                            "vvalue": "orange"
                        },
                        {
                            "kkey": "food",
                            "vvalue": "burger"
                        },
                        {
                            "kkey": "food",
                            "vvalue": "pizza"
                        },
                        {
                            "kkey": "food",
                            "vvalue": "pasta"
                        }
                    ]
                }
            }
        ]
    }
}

This time, we get all the 6 elements in the field_s field.

Are key and value special keywords?

I think you are spot on, @nehajatav . I think this is due to the way https://docs.vespa.ai/en/reference/schema-reference.html#type:map is implemented. I will have @geirst comment on this - it could be that using an array of key,value is a restriction and should be documented.

I will update once @geirst has chimed in - but you now know that not using key/value as names is a workaround, unless you want this to behave as a map ...

kkraune commented 2 years ago

@arnej27959 I think this behavior changes with vespa-8 / maybe we can configure Vespa for new behavior now? or other things we should know?

arnej27959 commented 2 years ago

To render maps as json maps use: renderer.json.jsonMaps as a query property.

kkraune commented 2 years ago

Thanks! here we use array of struct and I think we want to have the array with all elements in the response - not interpreted as a map with only unique key-elements

arnej27959 commented 2 years ago

Aha, it’s the opposite problem. Indeed, an array of objects with «key» and «value» fields is recognized as a map. So key/value are reserved words here.

107dipan commented 2 years ago

Thanks! Will define the schema with different key and value names.

kkraune commented 2 years ago

I have added this to documentation in https://github.com/vespa-engine/documentation/pull/2041 - thanks for finding this!