openculinary / knowledge-graph

The RecipeRadar knowledge graph stores and provides access to recipe and ingredient relationship information.
GNU Affero General Public License v3.0
10 stars 0 forks source link

Product names with canonicalizations are not being identified reliably #70

Closed jayaddison closed 1 year ago

jayaddison commented 2 years ago

Describe the bug Among ingredient lines that are not correctly identified by the knowledge graph, coriander and red chillies appear to be among the most frequent. This makes me think that something could be broken with our handling of product canonicalizations (implemented using synonym support in hashedixsearch).

To Reproduce Ran a query on the backend PostgreSQL database:

select
  unnest(tsvector_to_array(to_tsvector(description))) as term,
  count(*) as freq
from recipe_ingredients
where product_id is null
group by term
order by count(*) desc

...there's a fair amount of noise and stopwords in there (whole, 1, ...), but also some easy ingredient names that should have been matched to products.

Expected behavior Products with canonicalized names should be identified reliably.

Recommendation If synonyms are the cause of this, it may be worth writing up a brief design spec about how synonyms should behave in hashedixsearch. As far as I know, this wasn't clearly specified before an initial implementation was provided.

jayaddison commented 1 year ago

Canonicalizations are no longer an issue; this bug was resolved as part of (and was a motivating factor for) openculinary/backend#54.

Based on re-running the repro query from the description (with one slight modification: product_id -> product_name_id), It does appear that the term chilli continues to lack product mappings in a number of cases -- that's a separate issue however, and not related to canonicalization/synonyms.