mscholl96 / mad-recime

ReciMe is a fancy recipe generator based on AI. It offers an iOS App with remarkable user experience.
2 stars 0 forks source link

Research possible datasets to use #5

Closed WagnerMarcel closed 2 years ago

WagnerMarcel commented 2 years ago

GanttStart: 2022-01-01 GanttDue: 2022-01-16 Research possible large datasets of recipes which can be used in order to train the NN. The datasets should contain atleast 10k entries in order to prevent overfitting.

WagnerMarcel commented 2 years ago

Each analysis of dataset should contain:

WagnerMarcel commented 2 years ago

Use Recipe1M

Only use recipes with no more than 20 ingredients and 30 instructions

Output data format:

[
  {
    "title": "NULL",
    "ingredients": [
      {
        "ingredient": "NULL",
        "amount": "NULL",
        "unit":"NULL"
        "instruction": "NULL" (optional, left out in first approach)
      }
    ],
   "instructions": [
    {
      "instruction":"NULL"
    }
   ]
  }
]

Formatted with: https://jsonformatter.org

WagnerMarcel commented 2 years ago

Recipe box

Data Structure of recipeBox[^1]:

recipeBox

Property Amount
Recipes (stripped by empty ones) 124647
Ingredients (total) 1316950
Ingredients (unique words[^2]) 22364
Numeric expressions in ingredients 1369071
Instructions (total split by ".") 1924048
Instructions (unique words) 40387
Numeric expressions in instructions 913097

Words in Ingredients

With numeric expressions

ingredientsWithNumerical

Without numeric expressions

ingredientsWithoutNumerical

Words in Instructions

With numeric expressions

instructionsWithNumerical

Without numeric expressions

instructionsWithoutNumerical

Initial Parsing

Separating units and ingredients: initialParsingUnitIngredients

[^1]: Hashed strings can be replaced with cleartext by running the scraper again. [^2]: Stripped by unique chars (",.*®©™()[]") and transformed to lowercase. Still ~10k only appear once.

mscholl96 commented 2 years ago

Recipe1M

https://github.com/torralba-lab/im2recipe-Pytorch

Structure of data

image

Property Amount
Recipes 1029720
Number of ingredients[^1] 9605394
Different unique words in ingredients 91803
Numeric expressions in ingredients[^2] 10582774
Number of instructions[^1] 10767598
Different unique words in instructions 254082
Numeric expressions in instructions[^2] 4187244

[^1]: String which can consist of several words, see PlantUML [^2]: Describing amounts such as "1 apple"or "1/2 apple". These numeric values are kept in one separate category.

Words in ingredients

With numeric expressions

image

Without numeric expressions

image

Words in instructions

With numeric expressions

image

Without numeric expressions

image

WagnerMarcel commented 2 years ago

Analysis of duplications by title

Property Amount
Titles in recipeBox (total) 124595
Titles in recipeBox (unique) 114281
Titles in recipe1M (total) 1029720
Titles in recipe1M (unique) 809994
Titles in openRecipes (total) 173278
Titles in openRecipes (unique) 151025
Duplicated titles recipeBox / recipe1M 74995
Duplicated titles openRecipes / recipe1M 102108
Duplicated titles recipeBox / openRecipes 33102

comparison

WagnerMarcel commented 2 years ago

OpenRecipes

Structure of data

openRecipes

Property Amount
Recipes 173278
Number of ingredients 1693953
Different unique words in ingredients 27454
Numeric expressions in ingredients 2005536
Number of instructions 0
Different unique words in instructions 0
Numeric expressions in instructions 0

Words in ingredients

With numerical expressions

ingredientsWithNumericalExpressions

Without numerical expressions

ingredientsWithoutNumericalExpressions

WagnerMarcel commented 2 years ago

NowYoureCooking

Structure of data[^1]

nyc

Property Amount
Recipes 28287
Number of ingredients 284307
Different unique words in ingredients 77136
Numeric expressions in ingredients 236035
Number of instructions 232211
Different unique words in instructions 24442
Numeric expressions in instructions 88687

Words in ingredients

With numeric expressions

ingredientsWithNumericalExpressions

Without numeric expressions

ingredientsWithoutNumericalExpressions

Words in instructions

With numeric expressions

instructionsWithNumericalExpressions

Without numeric expressions

instructionsWithoutNumericalExpressions

[^1]: Approximated to json data format, recipes are provided in a custom dataformat.