Closed WagnerMarcel closed 2 years ago
Each analysis of dataset should contain:
Use Recipe1M
Only use recipes with no more than 20 ingredients and 30 instructions
Output data format:
[
{
"title": "NULL",
"ingredients": [
{
"ingredient": "NULL",
"amount": "NULL",
"unit":"NULL"
"instruction": "NULL" (optional, left out in first approach)
}
],
"instructions": [
{
"instruction":"NULL"
}
]
}
]
Formatted with: https://jsonformatter.org
Property | Amount |
---|---|
Recipes (stripped by empty ones) | 124647 |
Ingredients (total) | 1316950 |
Ingredients (unique words[^2]) | 22364 |
Numeric expressions in ingredients | 1369071 |
Instructions (total split by ".") | 1924048 |
Instructions (unique words) | 40387 |
Numeric expressions in instructions | 913097 |
Separating units and ingredients:
[^1]: Hashed strings can be replaced with cleartext by running the scraper again. [^2]: Stripped by unique chars (",.*®©™()[]") and transformed to lowercase. Still ~10k only appear once.
https://github.com/torralba-lab/im2recipe-Pytorch
Property | Amount |
---|---|
Recipes | 1029720 |
Number of ingredients[^1] | 9605394 |
Different unique words in ingredients | 91803 |
Numeric expressions in ingredients[^2] | 10582774 |
Number of instructions[^1] | 10767598 |
Different unique words in instructions | 254082 |
Numeric expressions in instructions[^2] | 4187244 |
[^1]: String which can consist of several words, see PlantUML [^2]: Describing amounts such as "1 apple"or "1/2 apple". These numeric values are kept in one separate category.
Property | Amount |
---|---|
Titles in recipeBox (total) | 124595 |
Titles in recipeBox (unique) | 114281 |
Titles in recipe1M (total) | 1029720 |
Titles in recipe1M (unique) | 809994 |
Titles in openRecipes (total) | 173278 |
Titles in openRecipes (unique) | 151025 |
Duplicated titles recipeBox / recipe1M | 74995 |
Duplicated titles openRecipes / recipe1M | 102108 |
Duplicated titles recipeBox / openRecipes | 33102 |
Property | Amount |
---|---|
Recipes | 173278 |
Number of ingredients | 1693953 |
Different unique words in ingredients | 27454 |
Numeric expressions in ingredients | 2005536 |
Number of instructions | 0 |
Different unique words in instructions | 0 |
Numeric expressions in instructions | 0 |
Property | Amount |
---|---|
Recipes | 28287 |
Number of ingredients | 284307 |
Different unique words in ingredients | 77136 |
Numeric expressions in ingredients | 236035 |
Number of instructions | 232211 |
Different unique words in instructions | 24442 |
Numeric expressions in instructions | 88687 |
[^1]: Approximated to json data format, recipes are provided in a custom dataformat.
GanttStart: 2022-01-01 GanttDue: 2022-01-16 Research possible large datasets of recipes which can be used in order to train the NN. The datasets should contain atleast 10k entries in order to prevent overfitting.