The adjective "ground" is listed as an ingredient name

jayaddison commented 3 years ago

Describe the bug Currently the word ground appears as a product (ingredient) name, although it's not really an ingredient name but instead an adjective.

To Reproduce Steps to reproduce the behavior:

Request https://www.reciperadar.com/api/autosuggest/ingredients?pre=ground (ingredient autosuggestions)
Find ground returned as one of the ingredient names in the response

Expected behavior Only ingredient results that use ground as an adjective (such as "ground nutmeg") should appear in the autosuggest response.

jayaddison commented 3 years ago

Some context here:

The products.csv file - which is loaded into the backend service database (ref) - should be considered the source of truth for product definitions at the moment. Sure enough, ground is in there as a product, so we'll want to remove it.
The knowledge-graph service finds and returns product names in ingredient text. To do this, it loads an in-process search index with a set of products - and it reads those products from a local file called hierarchy.json.
The hierarchy data is provided by the backend service - it renders it from the database in the service-specific /products/hierarchy endpoint.
This means that after we update the database to remove the product, we will need to re-generate the hierarchy.json file by requesting the latest hierarchy from the backend, and then re-deploy the knowledge graph

Historically the knowledge-graph used to generate the hierarchy.json file itself, as documented in the 'product parsing' writeup. An architectural decision was made to migrate and normalize the source-of-truth into the relational database so that it would be easier to edit and manage (see openculinary/backend#34).

That should eventually make data management easier - it's normalized and manually curated, and the dataset is of a manageable size (thousands of entries), but we're in a non-ideal interim phase at the moment where:

We do not have intuitive ways to manage the database and/or CSV contents and keep them in sync
The knowledge-graph contains data generated from the database at a fixed point-in-time, which may therefore become stale
Explaining and actually applying updates to single products takes longer than it should

This is all resolvable, but it'll require some careful thought and a decent amount of engineering time.

jayaddison commented 3 years ago

Status:

[x] products.csv update
[x] hierarchy.json file regeneration
[x] knowledge-graph service redeployment
[x] affected recipe reindexing
[x] verification

jayaddison commented 3 years ago

$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup like '%>ground<%')" --reindex

jayaddison commented 3 years ago

A few recipes failed during reindexing, so it's possible that some trailing references to the ingredient product ground remain in the recipe index.

jayaddison commented 3 years ago

Oops; an operational mistake: recrawling (not just reindexing) is required in this case. Perhaps we should remove that distinction in future; it's been very rare to reindex without also recrawling content.

$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup like '%>ground<%')" --recrawl

Edit: fixup: remove duplicate command-line prompt

jayaddison commented 3 years ago

One further adjustment to the query: since the word ground often appears at the start of an ingredient line, the letter 'g' is often capitalized. Therefore (and to be honest, as good practice in this case anyway) it makes sense to use a case-insensitive query:

$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup ilike '%>ground<%')" --recrawl

Most affected recipes have now been recrawled and reindexed; I'm tracing the few remainders.

jayaddison commented 3 years ago

This is now complete.

openculinary / knowledge-graph

The adjective "ground" is listed as an ingredient name #64