hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.62k stars 508 forks source link

Adds equipment as possible output #982

Closed jknndy closed 5 months ago

jknndy commented 6 months ago

Inspired by #958 this adds support for equipment as a possible output and adds coverage to 17 sites. I'm sure more sites can also provide this information.

I did not add 'equipment' to the generated test cases when making a new scraper as its rare that its available. I plan to update the documentation sometime this week!

jayaddison commented 6 months ago

(also: I think this will be a great feature to provide!)

jknndy commented 6 months ago

This could be nitpicky, but I'd prefer if the equipment lists were not sorted, but remain in the order that they were listed in the recipe (with duplicates removed, though).

for some reason some of the tests were failing without some sort of sorting feature. The tests were returning in a different order each time. forktospoon being an example, what do you think of using a format like this where it ensures the items retain the original order while still checking for duplicates?

Sites that have sorting ``` recipe_scrapers/amazingribs.py recipe_scrapers/bakingsense.py recipe_scrapers/bowlofdelicious.py recipe_scrapers/budgetbytes.py recipe_scrapers/downshiftology.py recipe_scrapers/elavegan.py recipe_scrapers/herseyland.py recipe_scrapers/joyfoodsunshine.py recipe_scrapers/momswithcrockpots.py ```
    def equipment(self):
        seen = set()
        return [
            equip.get_text()
            for equip in self.soup.find_all("div", class_="wprm-recipe-equipment-name")
            if equip.get_text()
            and (equip.get_text() not in seen and not seen.add(equip.get_text()))
        ]
jayaddison commented 6 months ago

what do you think of using a format like this where it ensures the items retain the original order while still checking for duplicates?

I'm not so keen on that approach, for two reasons:

I think this could be a situation where creating a utility/helper method would get us to the neatest result.

jknndy commented 5 months ago

I plan to take a look at this tomorrow, should have a fix before the weekend!

jknndy commented 5 months ago

Done! Been on the road for work a lot of the month so took a while to get to it. Added a utility all the clean up and implemented it in all the sites i listed last time.

Let me know if this is what you had in mind

Also after this is approved and merged I'll look at contributing to #958 to move it over to this approach for equipment.

jayaddison commented 5 months ago

Looks good, thanks! Do you think we should add some test coverage for get_equipment to tests/library/test_utils.py, or OK to rely on the expected recipe JSON outputs?

jknndy commented 5 months ago

I think it's fine to leave test coverage out for this one as the JSON results are pretty straight forward. If you disagree I can definitely add it if you'd like but the test coverage there covers more complex instances than this to me.