openculinary / api

The RecipeRadar API provides data services to the frontend application
GNU Affero General Public License v3.0
10 stars 3 forks source link

Unable to search for more than 38 ingredients #114

Open jayaddison opened 1 year ago

jayaddison commented 1 year ago

Describe the bug Searching for more than thirty-eight ingredients using the recipe search APIs (either search or explore) for more than thirty-eight ingredients currently fails.

The reason is that we construct an ingredient match boost based on a power-of-ten and the numbered position of each ingredient within the query (and in fact we double that during exact-match searching).

The search engine (currently opensearch, was elasticsearch) is written in Java and expects the boost to be an floating point value, with a supported value range approximately up to 10e38.

Strictly speaking we don't need to use base-10 here. We need enough numeral-distance between each boost value to disambiguate between exact and partial matches as documented here, but I think we could achieve that in base-4 or something. The question is whether the (already complex) logic to implement that is worthwhile.

If my calculations are correct, using a base-4 representation for the boost value would allow us up to 64 ingredients, and base-3 would allow up to 80.

Alternatively we could simply decide that some arbitrary number of ingredients is a sensible upper limit. Thirty or so seems like it may be reasonable.

To Reproduce Steps to reproduce the behavior:

  1. Visit the homepage.
  2. Add thirty eight or more ingredients to include.
  3. Click the search button.
  4. Find no results / notice an HTTP 500 response from the server.

Expected behavior Results should be returned based on all query ingredients -- or with an explanatory refinement message if the query omitted some of the ingredients.

Screenshots N/A

jayaddison commented 1 year ago

Naively it seems like it might be possible to boost the most significant ten query ingredients, while continuing to support an arbitrary number of additional ingredient terms. The challenge is that we rely on the scoring to sort (rank) the results, and particularly in the cases of sort-by-matches and sort-by-fewest-ingredients, accuracy -- meaning distinct boost values for each query ingredient -- is required to keep the result ordering consistent.

jayaddison commented 1 year ago

Naively it seems like it might be possible to boost the most significant ten query ingredients, while continuing to support an arbitrary number of additional ingredient terms. The challenge is that we rely on the scoring to sort (rank) the results, and particularly in the cases of sort-by-matches and sort-by-fewest-ingredients, accuracy -- meaning distinct boost values for each query ingredient -- is required to keep the result ordering consistent.

(I think that support for multiple scores per document might provide a cleaner way to address those limitations, but that ability remains open as a pending feature request, and is a non-blocker)

jayaddison commented 1 year ago

Although the number of ingredients in recipes shouldn't directly determine the choice of an upper number of ingredients that people can search for, it could be useful information to have as supporting context.

The most frequent ingredient counts in our database seem to be for between 6-10 unique named ingredients per recipe:

api=> with ingredient_counts (recipe_id, ingredient_count) as (select recipe_id, count(distinct product_name_id) from recipe_ingredients group by recipe_id) select ingredient_count, count(*) from ingredient_counts group by ingredient_count order by count(*) desc limit 10;
 ingredient_count | count 
------------------+-------
                8 |  7585
                7 |  7273
                9 |  7235
                6 |  6786
               10 |  6547
               11 |  5679
                5 |  5661
               12 |  4656
                4 |  4041
               13 |  3461
(10 rows)

And in terms of recipes that contain the most unique-named ingredients: our upper limit -- and this probably includes a few outliers -- at the moment seems to be around thirty ingredients; these recipes generally involve a few sauces/condiments and therefore require a sizable list of herbs and spices:

api=> with ingredient_counts (recipe_id, ingredient_count) as (select recipe_id, count(distinct product_name_id) from recipe_ingredients group by recipe_id) select ingredient_count, count(*) from ingredient_counts group by ingredient_count order by ingredient_count desc limit 10;
 ingredient_count | count 
------------------+-------
               30 |     1
               29 |     1
               28 |     3
               27 |     3
               26 |     6
               25 |    14
               24 |    18
               23 |    36
               22 |    45
               21 |    93
(10 rows)

We can't guarantee that when a user searches for thirty ingredients that they'll find a recipe that matches all of those, but as long as least a few of the ingredients are popular ones, we should be able to return some results.

Next up I'll have a think about how we could analyze our search result logs -- that include a results_total count to indicate how many results were found -- to see whether there are any clues in the correlation between number-of-ingredients-searched and number-of-recipes found.

jayaddison commented 1 year ago

I think I'd propose a five-step approach to resolving this:

  1. Deploy an updated version of the client that includes a refinement message to indicate that not all include query terms could be handled.
  2. Add an initial upper-bound limit of 30 on the number of include ingredients handled by the API, to prevent errors from occurring and improving the user experience. This should use the refinement message deployed in the previous step.
  3. Adjust the client so that the include selectbox does not accept more than the upper-bound limit number of ingredients. We use select2 on the client, and it looks like the maximumSelectionLength option should provide what we need there.
  4. Switch to a base-4 representation of boost scores API-side, and reconfigure the include limit to 50.
  5. Adjust the client to boost the include limit to 50 correspondingly.

I doubt many people will search for fifty ingredients. Currently the client interface isn't really optimal for entering that many items. But it may be valid in some situations, and it's good to provide functionality above-and-beyond what we think is usually expected.

Longer-term resolutions It'd be nice to remove this upper-limit entirely, because it's arbitrary. At its core, the problem is that we're using a float value as something like a bitset - a set of flags to indicate how each recipe (document) was scored during the search.

Perhaps we could file a request for a boost datatype of Java double in opensearch, to provide a larger bitset size and increase the upper-bound. This could have performance implications (both CPU/mathematical operation performance, and memory usage for boost scores in-memory on large document sets) though, so it may not be appealing as a search engine change.

Alternatively, we could refactor our code to avoid the bitset pattern entirely. I wasn't able to think of a way to do that using the existing functionality of either Elasticsearch or OpenSearch, however it is possible that something like opensearch-project/opensearch#3715 could help in future (there is an equivalent ticket open for Elasticsearch, too).

Edit: add refinement translation resource deployment step (total: five steps, was: four).

jayaddison commented 1 year ago

Add an initial upper-bound limit of 30 on the number of include ingredients handled by the API ... I don't think that this should require any changes to the client code

Mmm.. not quite. We would need to deploy updated translation resources for the refinement message, and currently that's not possible without redeploying the frontend application.

jayaddison commented 2 months ago

The reason is that we construct an ingredient match boost based on a power-of-ten and the numbered position of each ingredient within the query (and in fact we double that during exact-match searching).

Instead of multiplexing multiple results into a single float-valued score, I think an alternative approach may be possible:

This means that the query would adjust dynamically based on the input query terms (the list of ingredients to include). That's OK, and that's already the case - the boolean match clauses in each query already vary based on the user's query terms, for example.

This approach should also maintain the property that the sort-method script is static -- it shoudln't require a dynamic script for each query. I think that's likely to be a performance benefit, because it may mean that the JVM that evaluates the scoring can re-use the same already-compiled code module when evaluating each query's results.

This may allow simplifying the constant_score to simply return a fixed value for each document -- essentially indicating that no scoring is required (todo: does that provide a performance boost? or is it redundant to do so?).

jayaddison commented 2 months ago

Request a dynamic script-based field output value from the search engine at runtime: it should be an ordered multi-valued integer field, with values 0 representing 'not found', 1 representing 'found, inexact', and 2 representing 'found, exact'.

The following derived field in OpenSearch should more-or-less achieve this:

    "derived": {
        "_found": {
            "type": "long",
            "script": {
                "source": """
    for (product in params.products) {
        long score = 0 ;
        if (params._source['contents'].contains(product)) {
            score = 1;
            for (ingredient in params._source['ingredients']) {
                if (ingredient.product.singular == product) score = 2;
            }
        }
        emit(score);
   }
   """,
                "params": {"products": ["garlic", "onion", "tomato"]}
            }
        }
    },
    "fields": ["_found"],

It's not performance-optimized, and even with tuning it's possible that it will perform worse than the existing constant_score-based approach (since most of the values required for that are produced ahead-of-time and can be evaluated against index-level comparisons by the search engine).

The results of the above appear as:

curl -s -H 'Content-Type: application/json' -XPOST http://192.168.100.1:9200/recipes/_search --data @query.json | json_pp | grep found -C 5
               "src" : "https://www.cookincanuck.com/easy-braised-pork-chop-recipe-with-tomatoes-olives/",
               "time" : 25,
               "title" : "Braised Pork Chops with Tomatoes & Olives"
            },
            "fields" : {
               "_found" : [
                  1,
                  2,
                  1
               ]
            }
--
               "src" : "http://tastykitchen.com/recipes/main-courses/indian-style-lentils/",
               "time" : 60,
               "title" : "Indian-style Lentils"
            },
            "fields" : {
               "_found" : [
                  2,
                  2,
                  2
               ]
            }
--
               "src" : "http://www.bbcgoodfood.com/recipes/420632/onepan-prawn-and-tomato-curry",
               "time" : 30,
               "title" : "One-pan prawn & tomato curry"
            },
            "fields" : {
               "_found" : [
                  1,
                  2,
                  1
               ]
            }
--
               "src" : "http://tastykitchen.com/recipes/main-courses/mediterranean-zucchini-tomato-and-chicken-skillet/",
               "time" : 40,
               "title" : "Mediterranean Zucchini, Tomato and Chicken Skillet"
            },
            "fields" : {
               "_found" : [
                  2,
                  1,
                  1
               ]
            }
--
               "src" : "https://prepareandnourish.com/easy-chicken-scalloped-potatoes/",
               "time" : 80,
               "title" : "Chicken Potato Casserole (Easy Chicken Potato Bake)"
            },
            "fields" : {
               "_found" : [
                  1,
                  1,
                  2
               ]
            }