Add TASTEset data - Githubissues

boxydog commented 7 months ago

https://github.com/taisti/TASTEset/tree/main

MIT license.

In https://github.com/strangetom/ingredient-parser/discussions/6, strangetom says "more diverse data could be good", i.e. more different datasets.

strangetom commented 6 months ago

I've been having a look at integrating the TASTEset data into the training data. The current state of what I've done is in the tasteset branch.

The labels this data already has turned out to be less useful than I thought they would be because of the way the labelling is done, but since there were only ~3700 new sentences it wasn't that hard to fix any errors manually. There seems to be a negligible impact on model performance, most likely due to the small number of sentences added (about a 6% increase to the total training data size).

So why haven't I merged this branch into develop yet?

Some of the sentences have made me suspicious that the sentences have been edited from the original form they had in the recipe. For example:

150 grams sugar (/, 5 1/3 ounces) Why is the '/' on it's own?
2 large egg whites (extra-, at room temperature) Why is 'extra-' separated from 'large'
1 tablespoon orange liqueur (I use Cointreau) or 1 tablespoon orange juice (I use Cointreau) Why the repetition?
1⁄2 - 1 cup seasoned bread crumbs or 1/2-1 cup panko breadcrumbs _Why does the first fraction use a unicode FRACTIONSLASH but the second one doesn't?

There are a few weird things going on, but it looks to me that the sentences have been edited to put them into a particular structure and that process has resulted into some errors

The reason I think this might be a problem is that the aim of this library is to parse ingredient sentences, which means it needs to be trained on examples of ingredient sentences, and I'm not convinced all the sentences in this dataset are representative examples. The relatively small number of sentences compared to the rest of the data is probably why the impact on the model performance is negligible.

So, for now, I'm going to leave this in the tasteset branch until I decide to merge it or discard it.

boxydog commented 6 months ago

If you're not very sure of it, just discard it. Sorry I put you onto some useless effort.

Maybe it would be useful to try to identify some likely deficiencies in the current data? Perhaps rare ingredients (i.e. ingredients for which we don't have enough data)?

strangetom commented 6 months ago

No need to apologise, it looked useful and so was worth investigating.

Perhaps rare ingredients (i.e. ingredients for which we don't have enough data)?

There might be some value in comparing the ingredients in the training data with a database like https://fdc.nal.usda.gov/index.html (excluding their branded foods data) and seeing if that reveals any deficiencies. The difficult bit would be trying to find the common ingredients because the names won't align exactly.

strangetom commented 3 months ago

Closing this as I don't intend to use the tasteset data anymore.

I've recently included 15,000 sentences from the All Recipes data found in https://archive.org/details/recipes-en-201706, which has a higher proportion of branded ingredient names than the other datasets.

strangetom / ingredient-parser

Add TASTEset data #15