openculinary / knowledge-graph

The RecipeRadar knowledge graph stores and provides access to recipe and ingredient relationship information.
GNU Affero General Public License v3.0
10 stars 0 forks source link

The adjective "ground" is listed as an ingredient name #64

Closed jayaddison closed 3 years ago

jayaddison commented 3 years ago

Describe the bug Currently the word ground appears as a product (ingredient) name, although it's not really an ingredient name but instead an adjective.

To Reproduce Steps to reproduce the behavior:

  1. Request https://www.reciperadar.com/api/autosuggest/ingredients?pre=ground (ingredient autosuggestions)
  2. Find ground returned as one of the ingredient names in the response

Expected behavior Only ingredient results that use ground as an adjective (such as "ground nutmeg") should appear in the autosuggest response.

jayaddison commented 3 years ago

Some context here:

Historically the knowledge-graph used to generate the hierarchy.json file itself, as documented in the 'product parsing' writeup. An architectural decision was made to migrate and normalize the source-of-truth into the relational database so that it would be easier to edit and manage (see openculinary/backend#34).

That should eventually make data management easier - it's normalized and manually curated, and the dataset is of a manageable size (thousands of entries), but we're in a non-ideal interim phase at the moment where:

This is all resolvable, but it'll require some careful thought and a decent amount of engineering time.

jayaddison commented 3 years ago

Status:

jayaddison commented 3 years ago
$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup like '%>ground<%')" --reindex
jayaddison commented 3 years ago

A few recipes failed during reindexing, so it's possible that some trailing references to the ingredient product ground remain in the recipe index.

jayaddison commented 3 years ago

Oops; an operational mistake: recrawling (not just reindexing) is required in this case. Perhaps we should remove that distinction in future; it's been very rare to reindex without also recrawling content.

$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup like '%>ground<%')" --recrawl

Edit: fixup: remove duplicate command-line prompt

jayaddison commented 3 years ago

One further adjustment to the query: since the word ground often appears at the start of an ingredient line, the letter 'g' is often capitalized. Therefore (and to be honest, as good practice in this case anyway) it makes sense to use a case-insensitive query:

$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup ilike '%>ground<%')" --recrawl

Most affected recipes have now been recrawled and reindexed; I'm tracing the few remainders.

jayaddison commented 3 years ago

This is now complete.