reaper47 / recipya

A clean, simple and powerful recipe manager your whole family will enjoy.
GNU General Public License v3.0
147 stars 10 forks source link

Extract nutritional data from scraped websites #375

Open mblennegard opened 4 days ago

mblennegard commented 4 days ago

Is your feature request related to a problem? Please describe. Many websites today already have nutritional data present as part of the recipe. Instead of trying to calculate this using generic ingredients within Recipya it would be better to extract this information directly from the recipe as is.

Describe the solution you'd like If nutritional information is part of the recipe, try to extract it. If not available, then use the current way of calculating the nutritional data.

For websites requiring custom scrapers this will of course be on a per website basis, but as nutritional information is part of the LD+JSON schema it should be possible to solve this for a big number of websites automatically by adding the nutritional extraction to the LD+JSON part of the scraper. Additionally, this would also solve issues where the automatic nutritional calculation fails due to the recipe being in a language different than English.

reaper47 commented 4 days ago

The nutritional information is currently extracted from the LD+JSON when available: If it not available, then this function will execute in the background:

Which website did you fetch that calculated the nutrition instead of extracting it?

mblennegard commented 4 days ago

@reaper47 I was a bit too quick but I just noticed myself when browsing the scraper.go file that the nutrition was already part of the scraper...but you were even quicker to respond here. 😋

I tried with the following recipe:

It seems to be following the LD+JSON schema regarding the nutrition as well, at least as far as I can tell.

    "@context": "",
    "@type": "Recipe",
    "name": "Kladdkaka med hasselnötter och brynt smör",
    "image": "",
    "author": {
        "@type": "Person",
        "name": "Skippa Sockret"
    "description": "Kladdkaka med hasselnötter och brynt smör",
    "totalTime": "25 min ",
    "keywords": "bakmix kladdkaka, fika, dessert",
    "recipeCategory": "Kladdkakor",
    "recipeIngredient": [
        "4.25 dl bakmix kladdkaka ",
        "2 dl valfri mjölk",
        "2 msk olja",
        "50 g smör ",
        "1 dl rostade hasselnötter (eller efter smak)"
    "recipeInstructions": [
            "@type": "HowToStep",
            "text": "SÀtt ugnen pÄ 150 grader. "
            "@type": "HowToStep",
            "text": "MÀt upp kladdkakemixen och blanda ihop med mjölk och olja med hjÀlp av en slickepott. "
            "@type": "HowToStep",
            "text": "Bryn smöret i en kastrull tills du fÄr en nötig karaktÀr. "
            "@type": "HowToStep",
            "text": "Grovhacka hasselnötterna. "
            "@type": "HowToStep",
            "text": "TillsÀtt nu de brynta smöret och hasselnötterna i smeten, blanda runt. "
            "@type": "HowToStep",
            "text": "Smöra eller olja en rund springform och tÀck med lite kokos eller ströbröd alt. anvÀnd ett bakplÄtspapper. HÀll i smeten och grÀdda i ugnen cirka 15 minuter. "
            "@type": "HowToStep",
            "text": "Ta ut och lÄt svalna, lÄt gÀrna kladdkakan stÄ i kylen ett par timmar för godast resultat. Servera sedan med en riktigt god vaniljglass eller en klick grÀdde. "
    "nutrition": {
        "@type": "NutritionInformation",
        "servingSize": 8,
        "calories": 1109,
        "fatContent": 90,
        "carbohydrateContent": 77,
        "proteinContent": 42
reaper47 commented 4 days ago

Something is off because the nutrition is indeed there. I'll check it out.

mblennegard commented 3 days ago

@reaper47 I debugged this issue and the root cause is that this particular website stores only the numeric values for the nutritional information, whereas the scraper expects string only values in the UnmarshalJSON for the NutritionSchema ( The mapping of the nutrition fields is essentially skipped.

I tested this (rather crudely) for one of the properties with the below change, assuming that the nutrition function inside Recipya expects string values. This change then populated the property correctly in the final imported recipe.

if val, ok := x["carbohydrateContent"].(float64); ok {
    n.Carbohydrates = strconv.FormatFloat(val, 'f', -1, 64)

Perhaps the UnmarshalJSON function for the NutritionSchema could check and account for if the source data is string, float or integer and convert the values accordingly, to accomodate different implementations of the LD+JSON schema?

reaper47 commented 3 days ago

Excellent, thank you for looking into it! That is exactly it. We shall add a test in and modify the UnmarshalJSON function you linked to cover nutrition fields that use numerical values.

mblennegard commented 2 days ago

@reaper47 I have it handling both strings and number values on my end now, but then we have the interesting thing regarding that when we only have numbers we are also missing the unit type, e.g. grams, milligrams etc.

As far as I know, nutritional information is always in metric, even for american recipe sites. Have you seen anything else during your investigations?

If they are indeed always in metric then we can add static units for each property, e.g. calories in kcal, fat, sugar and protein in grams, sodium in milligrams etc., which we use in case of the nutritional information has number values.

Edit: At least the recipe schema specifies metric units, so I think I can assume metric if setting static units for each property. Do you agree?

reaper47 commented 2 days ago

Yes, nutrition is always in the metric system. I have yet to see a product in a grocery store in the US whose nutrition facts is not metric. We can safely assume the units you mentioned when not specified.

mblennegard commented 1 day ago

Implemented in pull request