Open jayaddison opened 1 year ago
Naively it seems like it might be possible to boost
the most significant ten query ingredients, while continuing to support an arbitrary number of additional ingredient terms. The challenge is that we rely on the scoring to sort (rank) the results, and particularly in the cases of sort-by-matches and sort-by-fewest-ingredients, accuracy -- meaning distinct boost
values for each query ingredient -- is required to keep the result ordering consistent.
Naively it seems like it might be possible to
boost
the most significant ten query ingredients, while continuing to support an arbitrary number of additional ingredient terms. The challenge is that we rely on the scoring to sort (rank) the results, and particularly in the cases of sort-by-matches and sort-by-fewest-ingredients, accuracy -- meaning distinctboost
values for each query ingredient -- is required to keep the result ordering consistent.
(I think that support for multiple scores per document might provide a cleaner way to address those limitations, but that ability remains open as a pending feature request, and is a non-blocker)
Although the number of ingredients in recipes shouldn't directly determine the choice of an upper number of ingredients that people can search for, it could be useful information to have as supporting context.
The most frequent ingredient counts in our database seem to be for between 6-10 unique named ingredients per recipe:
api=> with ingredient_counts (recipe_id, ingredient_count) as (select recipe_id, count(distinct product_name_id) from recipe_ingredients group by recipe_id) select ingredient_count, count(*) from ingredient_counts group by ingredient_count order by count(*) desc limit 10;
ingredient_count | count
------------------+-------
8 | 7585
7 | 7273
9 | 7235
6 | 6786
10 | 6547
11 | 5679
5 | 5661
12 | 4656
4 | 4041
13 | 3461
(10 rows)
And in terms of recipes that contain the most unique-named ingredients: our upper limit -- and this probably includes a few outliers -- at the moment seems to be around thirty ingredients; these recipes generally involve a few sauces/condiments and therefore require a sizable list of herbs and spices:
api=> with ingredient_counts (recipe_id, ingredient_count) as (select recipe_id, count(distinct product_name_id) from recipe_ingredients group by recipe_id) select ingredient_count, count(*) from ingredient_counts group by ingredient_count order by ingredient_count desc limit 10;
ingredient_count | count
------------------+-------
30 | 1
29 | 1
28 | 3
27 | 3
26 | 6
25 | 14
24 | 18
23 | 36
22 | 45
21 | 93
(10 rows)
We can't guarantee that when a user searches for thirty ingredients that they'll find a recipe that matches all of those, but as long as least a few of the ingredients are popular ones, we should be able to return some results.
Next up I'll have a think about how we could analyze our search result logs -- that include a results_total
count to indicate how many results were found -- to see whether there are any clues in the correlation between number-of-ingredients-searched and number-of-recipes found.
I think I'd propose a five-step approach to resolving this:
include
query terms could be handled.30
on the number of include
ingredients handled by the API, to prevent errors from occurring and improving the user experience. This should use the refinement
message deployed in the previous step.include
selectbox does not accept more than the upper-bound limit number of ingredients. We use select2
on the client, and it looks like the maximumSelectionLength
option should provide what we need there.boost
scores API-side, and reconfigure the include
limit to 50
.include
limit to 50
correspondingly.I doubt many people will search for fifty ingredients. Currently the client interface isn't really optimal for entering that many items. But it may be valid in some situations, and it's good to provide functionality above-and-beyond what we think is usually expected.
Longer-term resolutions
It'd be nice to remove this upper-limit entirely, because it's arbitrary. At its core, the problem is that we're using a float
value as something like a bitset - a set of flags to indicate how each recipe (document) was scored during the search.
Perhaps we could file a request for a boost
datatype of Java double
in opensearch
, to provide a larger bitset size and increase the upper-bound. This could have performance implications (both CPU/mathematical operation performance, and memory usage for boost
scores in-memory on large document sets) though, so it may not be appealing as a search engine change.
Alternatively, we could refactor our code to avoid the bitset pattern entirely. I wasn't able to think of a way to do that using the existing functionality of either Elasticsearch or OpenSearch, however it is possible that something like opensearch-project/opensearch#3715 could help in future (there is an equivalent ticket open for Elasticsearch, too).
Edit: add refinement
translation resource deployment step (total: five steps, was: four).
Add an initial upper-bound limit of
30
on the number ofinclude
ingredients handled by the API ... I don't think that this should require any changes to the client code
Mmm.. not quite. We would need to deploy updated translation resources for the refinement
message, and currently that's not possible without redeploying the frontend
application.
The reason is that we construct an ingredient match
boost
based on a power-of-ten and the numbered position of each ingredient within the query (and in fact we double that during exact-match searching).
Instead of multiplexing multiple results into a single float
-valued score, I think an alternative approach may be possible:
0
representing 'not found', 1
representing 'found, inexact', and 2
representing 'found, exact'.This means that the query would adjust dynamically based on the input query terms (the list of ingredients to include). That's OK, and that's already the case - the boolean match clauses in each query already vary based on the user's query terms, for example.
This approach should also maintain the property that the sort-method script is static -- it shoudln't require a dynamic script for each query. I think that's likely to be a performance benefit, because it may mean that the JVM that evaluates the scoring can re-use the same already-compiled code module when evaluating each query's results.
This may allow simplifying the constant_score
to simply return a fixed value for each document -- essentially indicating that no scoring is required (todo: does that provide a performance boost? or is it redundant to do so?).
Request a dynamic script-based field output value from the search engine at runtime: it should be an ordered multi-valued integer field, with values
0
representing 'not found',1
representing 'found, inexact', and2
representing 'found, exact'.
The following derived
field in OpenSearch should more-or-less achieve this:
"derived": {
"_found": {
"type": "long",
"script": {
"source": """
for (product in params.products) {
long score = 0 ;
if (params._source['contents'].contains(product)) {
score = 1;
for (ingredient in params._source['ingredients']) {
if (ingredient.product.singular == product) score = 2;
}
}
emit(score);
}
""",
"params": {"products": ["garlic", "onion", "tomato"]}
}
}
},
"fields": ["_found"],
It's not performance-optimized, and even with tuning it's possible that it will perform worse than the existing constant_score
-based approach (since most of the values required for that are produced ahead-of-time and can be evaluated against index-level comparisons by the search engine).
The results of the above appear as:
curl -s -H 'Content-Type: application/json' -XPOST http://192.168.100.1:9200/recipes/_search --data @query.json | json_pp | grep found -C 5
"src" : "https://www.cookincanuck.com/easy-braised-pork-chop-recipe-with-tomatoes-olives/",
"time" : 25,
"title" : "Braised Pork Chops with Tomatoes & Olives"
},
"fields" : {
"_found" : [
1,
2,
1
]
}
--
"src" : "http://tastykitchen.com/recipes/main-courses/indian-style-lentils/",
"time" : 60,
"title" : "Indian-style Lentils"
},
"fields" : {
"_found" : [
2,
2,
2
]
}
--
"src" : "http://www.bbcgoodfood.com/recipes/420632/onepan-prawn-and-tomato-curry",
"time" : 30,
"title" : "One-pan prawn & tomato curry"
},
"fields" : {
"_found" : [
1,
2,
1
]
}
--
"src" : "http://tastykitchen.com/recipes/main-courses/mediterranean-zucchini-tomato-and-chicken-skillet/",
"time" : 40,
"title" : "Mediterranean Zucchini, Tomato and Chicken Skillet"
},
"fields" : {
"_found" : [
2,
1,
1
]
}
--
"src" : "https://prepareandnourish.com/easy-chicken-scalloped-potatoes/",
"time" : 80,
"title" : "Chicken Potato Casserole (Easy Chicken Potato Bake)"
},
"fields" : {
"_found" : [
1,
1,
2
]
}
Describe the bug Searching for more than thirty-eight ingredients using the recipe search APIs (either
search
orexplore
) for more than thirty-eight ingredients currently fails.The reason is that we construct an ingredient match
boost
based on a power-of-ten and the numbered position of each ingredient within the query (and in fact we double that during exact-match searching).The search engine (currently
opensearch
, waselasticsearch
) is written in Java and expects theboost
to be an floating point value, with a supported value range approximately up to10e38
.Strictly speaking we don't need to use base-10 here. We need enough numeral-distance between each
boost
value to disambiguate between exact and partial matches as documented here, but I think we could achieve that in base-4 or something. The question is whether the (already complex) logic to implement that is worthwhile.If my calculations are correct, using a base-4 representation for the
boost
value would allow us up to 64 ingredients, and base-3 would allow up to 80.Alternatively we could simply decide that some arbitrary number of ingredients is a sensible upper limit. Thirty or so seems like it may be reasonable.
To Reproduce Steps to reproduce the behavior:
Expected behavior Results should be returned based on all query ingredients -- or with an explanatory refinement message if the query omitted some of the ingredients.
Screenshots N/A