hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

Sally's Baking Addiction - take the last image in the list rather than the first #1034

Open krisnoble opened 3 months ago

krisnoble commented 3 months ago

Sally's Baking Addiction provides a list of images in different sizes. The default behaviour is to take the first image from the list, but in this case that's a 225x225 thumbnail, with the original image at the end of the list.

This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.

jayaddison commented 3 months ago

Hi @krisnoble - thanks for the pull request!

The fix itself, selecting the last image instead of the first, sounds good. I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the SchemaOrg and AbstractScraper classes)

krisnoble commented 3 months ago

I'm guessing that you explored ways to avoid duplicating some of the logic here? (it doesn't look straightforward to me, given the relationship between the SchemaOrg and AbstractScraper classes)

In all honesty I'm still learning the ropes with this type of stuff so totally open to suggestions. I did initially have a version that just assumed a list and returned the last item, which worked fine on all my tests but I didn't want to take any risks of breaking edge cases, so went with a more defensive approach - but like you say it does duplicate the rest of the logic.

Looking at the website source, it also provides an ImageObject schema, so could try that first and fall back to the default if it isn't present. That would remove the duplication and preserve the default behaviour in case of an issue? Just tested the following which passes the tests although I suspect there may be a more elegant way to do it:

 def image(self):
        try:
            yoast_data = json.loads(self.soup.find('script', class_="yoast-schema-graph").get_text())['@graph']
            for elem in yoast_data:
                if elem['@type'] == 'ImageObject':
                    image = elem['contentUrl'] # https://developer.yoast.com/features/schema/pieces/image/
                    break
        except:
            # something went wrong so fall back to the default
            image = self.schema.image()

        return image
jayaddison commented 3 months ago

Apologies for the slow response here @krisnoble - I won't be able to respond completely until next week, but will provide a more complete review then.

jayaddison commented 3 months ago

Ok, from taking a bit more time to consider this: I think what I'd recommend here is based on the findings here:

This change makes the scraper take the last element in the list. Every recipe I've checked (from old to new) has the same list structure but I've preserved the original checks from SchemaOrg just in case, hopefully that's the right approach but happy to amend if not.

I don't think that we want to duplicate all of the SchemaOrg logic in each scraper (or any other) - it is the default, and it is used in many cases, but when we know of more-precise conditions on a per-website basis (like we do here), then we should generally implement (in fact, override) only what we need for that website. In this case, I think we should retrieve self.data.get("image"), and expect/assert that that the retrieved value is a list type -- and then return the last element from that list.

That should reduce the amount of code considerably, and also means that the question about whether to retain the surrounding comments from the original code is not relevant (because it's a distinct implementation, and isn't attempting to retain either the same behaviour or code presentation).

The risk is that we overfit (to borrow machine learning terminology) the implementation, and it breaks for some recipes that we haven't considered yet. Ideally we'd evaluate the scraper against a large number of sample recipes from an archive and confirm somehow that the error rate is low and that useful images are retrieved - but we don't have infrastructure to do that currently.