hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.6k stars 503 forks source link

Gousto.co.uk line breaks not recognized #756

Closed robmc-itpro closed 9 months ago

robmc-itpro commented 1 year ago

Pre-filing checks

The URL of the recipe(s) that are not being scraped correctly

...

The results you expect to see

The scraper doesn't seem to be recognizing the line breaks in the instructions. In Tandoor and Mealie this means there's some manual work required to fix the line breaks on each recipe. I'm not sure how fixable this is as I have no python/scraping experience.

Step 1 for example looks like the following on the original web page.

Boil a kettle

Slice the waxy potatoes (skins on) into 1-2cm thick discs

Add the vegetable stock mix to a large pot of boiled water and bring to a boil over a high heat

The results (including any Python error messages) that you are seeing

scraper.instructions_list()
['Boil a kettleSlice the waxy potatoes (skins on) into 1-2cm thick discsAdd the vegetable stock mix to a large pot of boiled water and bring to a boil over a high heat', 'Add the potato discs to the pot and cook for an initial 4 min', 'While the potatoes are cooking, chop most of the basil roughly, including the stalks (save some leaves for garnish!)Peel and chop the garlic roughlyTrim the green beans, then chop them in half', 'Add the tortiglioni to the pot and cook for a further 8 min', "Whilst the potatoes & tortiglioni are cooking, add the chopped basil and garlic (don't like raw garlic? Go easy!) to a food processorAdd the cashew nuts, half the flaked almonds (save the rest for garnish!), 70ml [140ml] olive oil, 1/2 tsp [1 tsp] salt and a very generous grind of black pepperBlitz until very smooth – this is your basil pesto", 'Add the green beans to the potatoes & tortiglioni and cook for a further 5 min or until everything is cookedOnce cooked, drain the potatoes, green beans & tortiglioniReturn everything to the pot', 'Add the basil pesto to the pot and give everything a good mix up – this is your pasta alla genovese', 'Serve the pasta alla genovese in bowls and garnish with the reserved basil leaves and flaked almondsDrizzle with some olive oil and season with a grind of black pepperEnjoy!']
jayaddison commented 1 year ago

Ah.. ok, I see what's going on here.

Basically: the instruction data from Gousto that we're calling normalize_string on may contain HTML, such as paragraph elements (<p>) in this case.

A quickish fix would probably be to look for paragraph elements specifically within the instructions and add newlines between paragraphs.