Closed jayaddison closed 3 years ago
Some context here:
The products.csv
file - which is loaded into the backend
service database (ref) - should be considered the source of truth for product definitions at the moment. Sure enough, ground
is in there as a product, so we'll want to remove it.
The knowledge-graph
service finds and returns product names in ingredient text. To do this, it loads an in-process search index with a set of products - and it reads those products from a local file called hierarchy.json
.
The hierarchy data is provided by the backend
service - it renders it from the database in the service-specific /products/hierarchy
endpoint.
This means that after we update the database to remove the product, we will need to re-generate the hierarchy.json
file by requesting the latest hierarchy from the backend
, and then re-deploy the knowledge graph
Historically the knowledge-graph
used to generate the hierarchy.json
file itself, as documented in the 'product parsing' writeup. An architectural decision was made to migrate and normalize the source-of-truth into the relational database so that it would be easier to edit and manage (see openculinary/backend#34).
That should eventually make data management easier - it's normalized and manually curated, and the dataset is of a manageable size (thousands of entries), but we're in a non-ideal interim phase at the moment where:
knowledge-graph
contains data generated from the database at a fixed point-in-time, which may therefore become staleThis is all resolvable, but it'll require some careful thought and a decent amount of engineering time.
Status:
products.csv
updatehierarchy.json
file regenerationknowledge-graph
service redeployment$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup like '%>ground<%')" --reindex
A few recipes failed during reindexing, so it's possible that some trailing references to the ingredient product ground
remain in the recipe index.
Oops; an operational mistake: recrawling (not just reindexing) is required in this case. Perhaps we should remove that distinction in future; it's been very rare to reindex without also recrawling content.
$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup like '%>ground<%')" --recrawl
Edit: fixup: remove duplicate command-line prompt
One further adjustment to the query: since the word ground
often appears at the start of an ingredient line, the letter 'g' is often capitalized. Therefore (and to be honest, as good practice in this case anyway) it makes sense to use a case-insensitive query:
$ python recipes.py --where "exists (select * from recipe_ingredients as ri where ri.recipe_id = recipes.id and ri.markup ilike '%>ground<%')" --recrawl
Most affected recipes have now been recrawled and reindexed; I'm tracing the few remainders.
This is now complete.
Describe the bug Currently the word
ground
appears as a product (ingredient) name, although it's not really an ingredient name but instead an adjective.To Reproduce Steps to reproduce the behavior:
ground
returned as one of the ingredient names in the responseExpected behavior Only ingredient results that use
ground
as an adjective (such as "ground nutmeg") should appear in the autosuggest response.