nytimes / ingredient-phrase-tagger

Extract structured data from ingredient phrases using conditional random fields
http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/
Other
785 stars 237 forks source link

Issues parsing irrational numbers #13

Open jackmcdade opened 6 years ago

jackmcdade commented 6 years ago

Have you guys run into issues trying to parse strings with fractions that correlate to irrational numbers? For example, the following will all return null for qty.

1/3 cup flour
2/3 tsp almond extract
14/15 gallon milk
adammck commented 6 years ago

Could you provide a bit more information here? How are you using this library? Some more examples of things which do vs don't work? I haven't touched this for a long time, so don't have much context.

It does look like your third example won't work (we're only matching a single digit on either side of the slash), but I'd be surprised if "1/3 cup flour" wasn't working, since 1/3 appears so frequently in our training data.

jackmcdade commented 6 years ago

We're using it inside a PHP application as an API, but even just using the included nyt-ingredients-snapshot-2015.csv data and basic CLI instructions from the README we get the same behavior. 14/15 is obviously not something you'd ever encounter in a recipe, but just trying to push the edges of what's actually happening under the hood here.

For example, here's the tagged result of 1/3 cup milk given the basic training model.

# 0.951035
1/3     I1    L4    NoCAP    NoPAREN    OTHER/0.998681
cup     I2    L4    NoCAP    NoPAREN    B-UNIT/0.956263
milk    I3    L4    NoCAP    NoPAREN    B-NAME/0.994245

1/3 is being tagged as OTHER, while 1/2 and 1/4 work just fine.

jackmcdade commented 6 years ago

I too would have assumed it wouldn't be an issue on this side, and spent a large amount of time ruling out every other possibility, retaining it with many different subsets of our user-submitted data with no luck. I finally decided to start from the ground up here and noticed that your dataset behaves exactly the same.

Definitely surprised. I'm hoping you have even the slightest idea of what's going on. 🙏