Closed boxydog closed 3 months ago
I've been having a look at integrating the TASTEset data into the training data. The current state of what I've done is in the tasteset
branch.
The labels this data already has turned out to be less useful than I thought they would be because of the way the labelling is done, but since there were only ~3700 new sentences it wasn't that hard to fix any errors manually. There seems to be a negligible impact on model performance, most likely due to the small number of sentences added (about a 6% increase to the total training data size).
So why haven't I merged this branch into develop yet?
Some of the sentences have made me suspicious that the sentences have been edited from the original form they had in the recipe. For example:
There are a few weird things going on, but it looks to me that the sentences have been edited to put them into a particular structure and that process has resulted into some errors
The reason I think this might be a problem is that the aim of this library is to parse ingredient sentences, which means it needs to be trained on examples of ingredient sentences, and I'm not convinced all the sentences in this dataset are representative examples. The relatively small number of sentences compared to the rest of the data is probably why the impact on the model performance is negligible.
So, for now, I'm going to leave this in the tasteset
branch until I decide to merge it or discard it.
If you're not very sure of it, just discard it. Sorry I put you onto some useless effort.
Maybe it would be useful to try to identify some likely deficiencies in the current data? Perhaps rare ingredients (i.e. ingredients for which we don't have enough data)?
No need to apologise, it looked useful and so was worth investigating.
Perhaps rare ingredients (i.e. ingredients for which we don't have enough data)?
There might be some value in comparing the ingredients in the training data with a database like https://fdc.nal.usda.gov/index.html (excluding their branded foods data) and seeing if that reveals any deficiencies. The difficult bit would be trying to find the common ingredients because the names won't align exactly.
Closing this as I don't intend to use the tasteset data anymore.
I've recently included 15,000 sentences from the All Recipes data found in https://archive.org/details/recipes-en-201706, which has a higher proportion of branded ingredient names than the other datasets.
https://github.com/taisti/TASTEset/tree/main
MIT license.
In https://github.com/strangetom/ingredient-parser/discussions/6, strangetom says "more diverse data could be good", i.e. more different datasets.