hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.68k stars 520 forks source link

Grouped instructions, APIs, and open schema (Gousto in v5) #1225

Open tboby opened 3 weeks ago

tboby commented 3 weeks ago

Gousto is in some ways an optimal source for recipes. For each recipe (https://www.gousto.co.uk/cookbook/vegetarian-recipes/3-cheese-veg-packed-pasta-bake) they have a public api which provides JSON data used to build the recipe page. This includes the useful grouping of tiny steps into stages.

In v5 the ability to make requests in a scraper was removed, and with it the Gousto scraper. As I couldn't see a way forward using a custom scraper in this repo, I decided to quickly put together a script to generate an open schema document that could be fed to recipe-scrapers instead.

const normalizeString = str => str ? str.trim() : '';
const getMinutes = timeStr => {
    const timeParts = timeStr.match(/(\d+)/);
    return timeParts ? parseInt(timeParts[0], 10) : 0;
};
const getYields = yieldStr => yieldStr ? yieldStr.trim() : '';

function generateRecipeHTML(recipeData) {
    const title = normalizeString(recipeData.title);
    const description = normalizeString(recipeData.description);
    const totalTime = Object.values(recipeData.prep_times).sort()[0];
    const yields = getYields(Object.keys(recipeData.prep_times).sort()[0]);
    const image = recipeData.media.images.reduce((max, img) => img.width > max.width ? img : max, recipeData.media.images[0]).image;

    const ingredients = recipeData.ingredients
        .filter(ingredient => typeof ingredient === 'object' && ingredient.label)
        .map(ingredient => normalizeString(ingredient.label));

    const instructionsList = recipeData.cooking_instructions
        .filter(instruction => typeof instruction === 'object' && instruction.instruction)
        .map(instruction => normalizeString(instruction.instruction));

    const instructions = instructionsList.map(item => item.replaceAll("<p>","").replaceAll("</p>",""));
    const ratings = recipeData.rating.average;

    return `
        <div itemscope itemtype="http://schema.org/Recipe">
            <h1 itemprop="name">${title}</h1>
            <p itemprop="description">${description}</p>
            <div>
                <span>Prep time: <time itemprop="totalTime" datetime="PT${totalTime}M">${totalTime} minutes</time></span>
                <span>Yields: <span itemprop="recipeYield">${yields}</span></span>
            </div>
            <img itemprop="image" src="${image}" alt="${title}" />
            <h2>Ingredients</h2>
            <ul>
                ${ingredients.map(ingredient => `<li itemprop="recipeIngredient">${ingredient}</li>`).join('')}
            </ul>
            <h2>Instructions</h2>
            <ol itemprop="recipeInstructions">
                ${instructions.map(step => `<li>${step}</li>`).join('')}
            </ol>
            <div>
                <span>Rating: <span itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
                    <span itemprop="ratingValue">${ratings}</span> out of 5
                </span></span>
            </div>
            <p itemprop="keywords">www.gousto.co.uk</p>
            <p itemprop="recipeCuisine">${recipeData.cuisine.title}</p>
        </div>
    `;
}

This works fine, except for one major issue: the open schema scraper concatenates all instruction steps together with "\n" by default.

The downstream app I'm using (Tandoor) provides a handy "split instructions by newline" feature to help parse recipes with a single block of text; however due to the joining of instructions it can't tell the difference between newlines in the original steps (that I'd like to keep) and newlines inserted by the open schema scraper.

Is openschema intended to be a universal input? Otherwise, is this an downstream issue and should I be trying to see if Tandoor would accept recipes in JSON (avoiding the use of recipe-scrapers) instead of HTML?

jayaddison commented 3 weeks ago

Is openschema intended to be a universal input? Otherwise, is this an downstream issue and should I be trying to see if Tandoor would accept recipes in JSON (avoiding the use of recipe-scrapers) instead of HTML?

Hi @tboby - thanks for the discussion.

You're probably aware of some/all of this, but as a recap/background:

We have three independent projects in this case, each with their own slightly different perspectives.

That's important background to answer your question:

Is openschema intended to be a universal input? Otherwise, is this an downstream issue and should I be trying to see if Tandoor would accept recipes in JSON (avoiding the use of recipe-scrapers) instead of HTML?

I can't really provide a useful answer about the ambitions of schema.org -- but it certainly meets the requirements for recipe-scrapers, and that simplifies our code a lot. In some ways it'd be nice if we also had parallel HTML retrieval for the recipe websites we support, because then we could verify that what is claimed in the schema.org matches what appears to users -- but that's not something we support today.

I think it's certainly valid to ask Tandoor whether they would consider retrieving from schema.org JSON directly. If I were them, then on the one hand I would be quite comfortable offloading the support/accuracy checks about recipes to a separate component (e.g. recipe-scrapers) - but there is also certainly a performance and dependency management angle to consider too (fewer dependencies is good) - and, maybe at a lower priority, there could potentially be a vendor-lockin concern (schema.org is good, but perhaps the ability to support sites that don't use it for whatever reason is also beneficial).