problem parsing two-word ingredients that begin with lower-case 'a'

Stuckyville commented 5 years ago

When entering a two-word ingredient where the first word begins with lower-case 'a', the parser strips the leading 'a' and treats it like a quantity. For example, 'apple juice' becomes 'pple juice' with a quantity of of '1'. Further detail discussed at https://answers.launchpad.net/gourmet/+question/678095

saxon-s commented 5 years ago

Environment: Gourmet 0.17.4 and master branch on Ubuntu and Windows.

Steps to reproduce:

Click "New" button for new recipe
Click "Ingredients" tab
Add each of the following ingredients individually to "Add ingredient" text field: "apple juice" "Apple juice" "apricot" "an avocado" "a beet" "a dozen eggs" "a pair of Yubari King melons"

Expected Results:

Expect ingredients to be listed as: "apple juice" "Apple juice" "apricot" "1 avocado" "1 beet" "12 eggs" "2 Yubari King melons"

Actual Results:

Instead, ingredients are listed as: "1 pple juice" "Apple juice" "apricot" "1 avocado" "1 beet" "12 eggs" "2 Yubari King melons"

Analysis: If the first word in an ingredient (more than one word) string starts with a lower case "a", the first letter ("a") of the first word is stripped off and substituted with quantity of "1", "a dozen" is substituted with quantity of "12" and "a pair" is substituted with quantity of "2".

Gourmet is designed to translate word numbers into equivalent numbers, for example: "a" --> "1" "an" --> "1" "a couple" --> "2" "a dozen" --> "12" "twenty" --> "20"

Conclusion:

There appears to be a bug in the ingredient parser. The ingredient parser should only translate "a" to "1" if it is single character.
In addition, the ingredient parser is not translating capitalized words number correctly, for example: "A dozen" is not translated to quantity of "12".

martinp26 commented 4 years ago

There are multiple problems here:

NUMBER_WORD_REGEXP is missing word boundaries around the individual regex elements, this leads to finding 'a' in the middle of words. Not sure if this would be enough.
The number words are also NOT put through translation. The German version still has "one" ... "ten" in the regex. This has the side effect of early terminating the search in the minutes translation "Minuten" -> "Minu" which then does not parse. Re-editing recipes leads to losing time annotations.

A simple workaround is this in gourmet/convert.py:

@@ -644,7 +648,7 @@ all_number_words.sort( lambda x,y: ((len(y)>len(x) and 1) or (len(x)>len(y) and -1) or 0) )

-NUMBER_WORD_REGEXP = '|'.join(all_number_words).replace(' ','\s+') +NUMBER_WORD_REGEXP = None FRACTION_WORD_REGEXP = '|'.join(filter(lambda n: NUMBER_WORDS[n]<1.0, all_number_words) ).replace(' ','\s+')

I believe the NUMBER_FINDER.finditer(timestring) in timestring_to_seconds should not blindly look for the next num-like match, but only after the non-num words after the last match have been consumed.

"12 Minuten" is currently parsed as [12 Minu] [ten]

saxon-s commented 4 years ago

@martinp26 Thank you for investigating the issue and the simple workaround.

thinkle / gourmet

problem parsing two-word ingredients that begin with lower-case 'a' #931