Canonical url intergration batch #4 - Githubissues

hhursev / recipe-scrapers

Python package for scraping recipes data

MIT License

1.68k stars 519 forks source link

Canonical url intergration batch #4 #921

Closed jknndy closed 10 months ago

jknndy commented 10 months ago

Didn't realize i had already started this one a few weeks ago!

Dropped support

thenutritiouskitchen- Site no longer exists

Adjusted scrapers

recipe_scrapers/southernliving.py - yields html fallback, pages that contain ranges of yields don't return via schema recipe_scrapers/sweetpeasandsaffron.py - Added get_minutes & get_yields util recipe_scrapers/zeitwochenmarkt.py - Ingredients tag changed

Notes

tests/test_waitrose.py - Returning a text encoding issue

- "Bake for 40–50 minutes or until golden.",
+ "Bake for 40â50 minutes or until golden.",

## current fix
replace("â ", "â\x80\x93 ")

Thoughts on adding canonical_url to the list of required test coverage?

jayaddison commented 10 months ago

Thoughts on adding canonical_url to the list of required test coverage?

It's a good idea. Could the generate.py script attempt to add it automatically there too? (two problems: the HTML might not contain the canonical URL, and the URL provided to generate.py on the commandline might end up being redirected to a different URL.. but initially we could simply write that URL that the user provided)

jayaddison commented 10 months ago

Re: the encoding issue for Waitrose -- it looks like another case where the HTML you received from the site had some slightly unusual encoding. I've downloaded an updated copy with wget2 and your scraper seems to work fine with that; I'll add the updated HTML here soon.

jknndy commented 10 months ago

Could the generate.py script attempt to add it automatically there too?

Hadn't even thought of being able to add it automatically, I'll take a look at that once this is merged, I can also add some info to the developer guide to explain how to handle it with the unittest skip if it won't pass.

jayaddison commented 10 months ago

Could the generate.py script attempt to add it automatically there too?

Hadn't even thought of being able to add it automatically, I'll take a look at that once this is merged, I can also add some info to the developer guide to explain how to handle it with the unittest skip if it won't pass.

Sounds great, yep. The reason I thought about adding it automatically is to try to avoid the process of writing a new scraper becoming more difficult.

We could place # @unittest.skip("canonical_url is not available from this webpage") (commented-out) above the autogenerated test_canonical_url method -- it'd be easy to spot and suggest removing it during code review, or for anyone who has problems with gathering the canonical_url they could uncomment it.

jayaddison commented 10 months ago

Is that all of them @jknndy? :)

jknndy commented 10 months ago

Is that all of them @jknndy? :)

I'd say 95% there are a few that i skipped throughout the process that were taking up to much time to complete or required some work for encoding issues. I'll take a shot at them this week and will try the wget2 for my encoding issues!

Then it's back to ingredient_grouping additions!

jayaddison commented 10 months ago

Brilliant. Nice work!