vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.62k stars 589 forks source link

Question : How are we searching on different schema types with different fields defined in schema #24267

Closed 107dipan closed 1 year ago

107dipan commented 1 year ago

We were trying to understand how vespa performs search on multiple schema with different files defined in them. We have seen that vespa throws a 404 exception whenever we search on a field that does not exist in a schema. But when we are using a query like "select from where..." the field we are searching on can be absent from some of the schema.

We wanted to confirm that we will get the 404 exception only when the field is not part of any of the schema. Also, can we use the searchers and document processors to find if a field is part of a schema defined?

kkraune commented 1 year ago

Hi,

In my experiments I get:

$ vespa query "select * from sources * where albums contains 'query in nonexist field'"
Error: invalid query: 400 Bad Request
{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 0
        },
        "errors": [
            {
                "code": 4,
                "summary": "Invalid query parameter",
                "message": "Could not create query from YQL: Field 'albums' does not exist.",
                "stackTrace": "java.lang.IllegalArgumentException: Field 'albums' does not exist.

When the albums field is defined in one schema but not the other.

https://docs.vespa.ai/en/federation.html is a great source for multi-schema documentation.

There is no easy API to inspect config afaik, other than parsing the schemas yourself. I think the question is, a query over multiple schemas kind of requires the schemas to have common elements to make sense. Maybe https://docs.vespa.ai/en/reference/schema-reference.html#alias can be useful for your app

107dipan commented 1 year ago

Hi,

We have deployed multiple schemas where a field can be part of one or more than one schema. But we got this exception only when the field was not part of any of the schema. I have added the response for the same.

{ "root": { "id": "toplevel", "relevance": 1.0, "fields": { "totalCount": 0 }, "errors": [ { "code": 4, "summary": "Invalid query parameter", "message": "Could not create query from YQL: Field 'verypropertyaddress_s' does not exist.", "stackTrace": "java.lang.IllegalArgumentException: Field 'verypropertyaddress_s' does not exist.\n\tat com.google.common.base.Preconditions.checkArgument(Preconditions.java:191)\n\tat com.yahoo.search.yql.YqlParser.getIndex(YqlParser.java:1762)\n\tat com.yahoo.search.yql.YqlParser.buildTermSearch(YqlParser.java:1172)\n\tat com.yahoo.search.yql.YqlParser.convertExpression(YqlParser.java:352)\n\tat com.yahoo.search.yql.YqlParser.buildTree(YqlParser.java:288)\n\tat com.yahoo.search.yql.YqlParser.parse(YqlParser.java:267)\n\tat com.yahoo.search.yql.MinimalQueryInserter.insertQuery(MinimalQueryInserter.java:95)\n\tat com.yahoo.search.yql.MinimalQueryInserter.search(MinimalQueryInserter.java:80)\n\tat com.yahoo.search.Searcher.process(Searcher.java:134)\n\tat com.yahoo.processing.execution.Execution.process(Execution.java:112)\n\tat com.yahoo.search.searchchain.Execution.search(Execution.java:514)\n\tat com.yahoo.prelude.searcher.FieldCollapsingSearcher.search(FieldCollapsingSearcher.java:101)\n\tat com.yahoo.search.Searcher.process(Searcher.java:134)\n\tat com.yahoo.processing.execution.Execution.process(Execution.java:112)\n\tat com.yahoo.search.searchchain.Execution.search(Execution.java:514)\n\tat com.yahoo.prelude.querytransform.PhrasingSearcher.search(PhrasingSearcher.java:60)\n\tat com.yahoo.search.Searcher.process(Searcher.java:134)\n\tat com.yahoo.processing.execution.Execution.process(Execution.java:112)\n\tat com.yahoo.search.searchchain.Execution.search(Execution.java:514)\n\tat com.yahoo.prelude.statistics.StatisticsSearcher.search(StatisticsSearcher.java:228)\n\tat com.yahoo.search.Searcher.process(Searcher.java:134)\n\tat com.yahoo.processing.execution.Execution.process(Execution.java:112)\n\tat com.yahoo.search.searchchain.Execution.search(Execution.java:514)\n\tat com.yahoo.search.querytransform.WeakAndReplacementSearcher.search(WeakAndReplacementSearcher.java:22)\n\tat com.yahoo.search.Searcher.process(Searcher.java:134)\n\tat com.yahoo.processing.execution.Execution.process(Execution.java:112)\n\tat com.yahoo.search.searchchain.Execution.search(Execution.java:514)\n\tat com.yahoo.search.handler.SearchHandler.searchAndFill(SearchHandler.java:457)\n\tat com.yahoo.search.handler.SearchHandler.search(SearchHandler.java:502)\n\tat com.yahoo.search.handler.SearchHandler.handleBody(SearchHandler.java:376)\n\tat com.yahoo.search.handler.SearchHandler.handle(SearchHandler.java:289)\n\tat com.yahoo.container.jdisc.ThreadedHttpRequestHandler.handle(ThreadedHttpRequestHandler.java:78)\n\tat com.yahoo.container.jdisc.ThreadedHttpRequestHandler.handleRequest(ThreadedHttpRequestHandler.java:89)\n\tat com.yahoo.container.jdisc.ThreadedRequestHandler$RequestTask.processRequest(ThreadedRequestHandler.java:191)\n\tat com.yahoo.container.jdisc.ThreadedRequestHandler$RequestTask.run(ThreadedRequestHandler.java:185)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n" } ] } }

kkraune commented 1 year ago

I am not sure what is the problem?

"Field 'verypropertyaddress_s' does not exist." is correct, yes?

107dipan commented 1 year ago

The field verypropertyaddress_s is not part of any of the schema and we are getting the exception which is expected. But the behavior I wanted to confirm was what happens when we search on field which is part of some schema and not present in some. The behavior I have seen in these scenarios is we dont get any exception. Just wanted to confirm the behavior of the same from the vespa team.

kkraune commented 1 year ago

But the behavior I wanted to confirm was what happens when we search on field which is part of some schema and not present in some.

In my simple experiment above, I got "Error: invalid query: 400 Bad Request" - I think this is the correct behavior. Please provide an example we can reproduce if you think this is incorrect - thanks :-)

107dipan commented 1 year ago

Reproduction scenario : Define field newFieldA_s,newFieldB_s as part of schema 1 and newFieldD_s,newFieldC_s as part of schema 2. Use select from sources where newFieldA_s contains x ->This is not throwing any exception

baldersheim commented 1 year ago

Which vespa version ? We enabled stricter checking in vespa 8 where we required query to be valid for all document types. Earlier on we did best effort. Now we fail faster.

nehajatav commented 1 year ago

So for a query like below which expects ranking of documents across schemas, vespa 8 no longer allows querying across fields that are mutually exclusive across schemas? What is the rationale behind disabling this?

select * from sources schemaA, schemaB where field1SchemaA=apple or field2SchemaB=banana

kkraune commented 1 year ago

Thanks Henning for making the correction - that explains the difference

Rationale: It is more correct to fail a query to a source that cannot succeed correctly (querying a non-existing field), than silently ignoring it.

Personally, I think it makes more sense to align different data sources in the schemas, so users can make meaningful queries to them, and meaningful query responses can be made. In cases such alignment is hard to do, one can always add data to a more generic text content field (same in all sources) for generic query recall. And consider query rewrites - if the query uses specific fields, exclude the sources that don't have them (I guess that is why you asked "can we use the searchers and document processors to find if a field is part of a schema defined?" in the first place).

I hope this makes sense, and I understand that in some cases the stricter behavior can be more difficult - but for most users, we found it is better to fail than best effort, that had other odd side effects.

Also, support on the most recent major Vespa version is much easier than Vespa 7, it is hard to remember everything :-)

107dipan commented 1 year ago

Hi, We started vespa evaluation with version 7 so kind of wanted to update the version once our initial POC was done. But we will also start using vespa version 8 to understand all the differences.

I didnt exactly get if this type of query will work. select * from sources schemaA, schemaB where field1SchemaA=apple or field2SchemaB=banana

My personal take would be that since a user can specify the schemas he wants to search on, any user requiring that can mention the schemas in the search queries and users who require searching and ranking over multiple schemas can use the "*" or comma separated field. This would enable a lot of business use cases and keeping it simple and configurable for the different users.

kkraune commented 1 year ago

on 8, fields in query must be in the sources, it will fail otherwise. on 7, best effort is used and you can test it by using https://github.com/vespa-engine/sample-apps/tree/master/album-recommendation and two schemas

OK, thanks all, I think this resolves the question, closing