openculinary / knowledge-graph

The RecipeRadar knowledge graph stores and provides access to recipe and ingredient relationship information.
GNU Affero General Public License v3.0
10 stars 0 forks source link

Markup term consumption is incorrect for ngrams > 1 #32

Closed jayaddison closed 4 years ago

jayaddison commented 4 years ago

Describe the bug When generating markup for ingredients containing more than one word (i.e. ngrams > 1), the markup engine tends to discard words that appeared towards the end of the input.

For three-word ingredient names, two words are dropped. For two-word ingredient names, one word is dropped. Single-word ingredient names are not affected.

Duplicate words from the ingredient name appear in the place of the dropped words.

For example:

$ curl -H 'Host: knowledge-graph' -XPOST 192.168.100.1:30080/ingredients/query --data 'descriptions[]=large red bell pepper for burritos'  | jq
{
  "results": {
    "large red bell pepper for burritos": {
      "ancestors": [
        "bell pepper",
        "pepper"
      ],
      "category": null,
      "contents": [
        "red bell pepper"
      ],
      "is_plural": false,
      "markup": "large <mark>red bell pepper</mark> bell pepper",
      "plural": "red bell peppers",
      "product": "red bell pepper",
      "singular": "red bell pepper"
    }
  }
}

To Reproduce Steps to reproduce the behavior:

  1. Query the knowledge-graph using an ingredient line that contains a multi-word ingredient name
  2. Observe that the end of the markup response field contains incorrect words

Expected behavior All of the words from the original ingredient description should appear, and the ingredient name should be marked.