Website scraper - www.thecookingguy.com

Warren73 commented 8 months ago

www.thecookingguy.com

https://www.thecookingguy.com/cookbook/2023/11/26/short-rib-mac-amp-cheese https://www.thecookingguy.com/cookbook/2023/11/6/buffalo-chicken-penne)

rmdluo commented 4 months ago

Hi, a few friends and I are trying to work on our first open source contribution for a CMU software dev class. We would like to work on this issue and get assigned to it if that's alright? It looks like some of the fields can be scraped with wild mode, but others will need some extra work (yields, ingredients, instructions, and description). Something we are concerned about is that the recipes from thecookingguy.com doesn't include a rating, cuisine, category, or total time. How should we go about handling these cases? Would returning empty strings for these suffice? Thanks!

jknndy commented 4 months ago

Hi @rmdluo, Thanks for your interest in contributing to the library. I'd be happy to assign this issue to you.

It looks like some of the fields can be scraped with wild mode, but others will need some extra work (yields, ingredients, instructions, and description).

Having just taken a quick look at the recipe listed in this issue it appears the only covered fields are ingredients, instructions, title, author & description. Ingredients and Instructions will require some custom scraping logic, see here for our related docs while the remaining should be covered by our schema based retrieval see here for our implementation and here for some general information about the recipe Schema in general.

Something we are concerned about is that the recipes from thecookingguy.com doesn't include a rating, cuisine, category, or total time. How should we go about handling these cases? Would returning empty strings for these suffice?

For Mandatory fields the scraper should return null by default, for Optional fields you should omit the line entirely from the test json. A list of Mandatory vs Optional keys can be found here.

If you're also interested in implementing a full Schema site i'd recommend checking out #1053

To see a site that is covered largely by custom logic for examples check out the code for BodyBuilding

Feel free to open up the PR as early in the development process as you'd like for input or ping here if you have any questions or need any clarification on best practices/etc.

Edit 1: Adding to the info about what to do for fields that are not covered, if the field is covered in the .py file but not in the site you should additionally remove the section of the code from the .py file.

rmdluo commented 4 months ago

@jknndy We're currently making good progress and have implemented yields and ingredients. We're also working on adding the testing data and were wondering if we should be including a "site_name" for the testing data. The site is pretty clearly marked as "Sam The Cooking Guy" in the browser, but it isn't explicitly called this in the html. It is referenced in the page title though "Chili Cumin Lamb Recipe from Sam The Cooking Guy", so we were wondering if we should leave the test data as "site_name": null for now or use "site_name": "Sam The Cooking Guy".

Also, our ingredient_groups implementation is quite long. Would it be best practice to it in a separate smaller pr?

hhursev / recipe-scrapers

Website scraper - www.thecookingguy.com #957