hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

allrecipes ingredients scraper broken in 14.54.0 #1054

Closed vabene1111 closed 2 months ago

vabene1111 commented 3 months ago

noticed because one of my tests in tandoor broke after updating.

presumably this change https://github.com/hhursev/recipe-scrapers/compare/14.53.0...14.54.0#diff-45be87086299f18eed9eb7c07cc8cc35772d65d95bb0321732f0a9078253cce1 broke import for this test file (its a dummed down recipe imported a long time ago).

Can look into fixing this at some point but no time right now so just writing it down. (also not properly tested but nothing besides scrapers was changed so at least some API does not behave like before or its the scraper)

Example

```html Test_Remove_ ```

jknndy commented 2 months ago

Hi @vabene1111 , I'm not able to replicate the error you're encountering. The change in the diff you listed adjusted the code to better handle instances of fractions in the ingredient listing. Below is the tests I ran, let me know if you see a difference from what you're trying or if you could provide any additional info... I think the issue could be that the test input you listed above doesn't provide the same output format that the actual site would.

test_url.py

from recipe_scrapers import scrape_me

scraper = scrape_me('https://www.allrecipes.com/recipe/24010/easy-chicken-marsala/')

print("Host:", scraper.host())
print("Title:", scraper.title())
print("Total Time:", scraper.total_time())
print("Image URL:", scraper.image())
print("Ingredients:")
for ingredient in scraper.ingredients():
    print("-", ingredient)

output

(venv) python test_site.py
Host: allrecipes.com
Title: Easy Chicken Marsala
Total Time: 40
Image URL: https://www.allrecipes.com/thmb/uaZcOYmcaVoYjY8zT88-OXl9nPE=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/7758938-59ed755fd3884a8da27d0686e0d82b41.jpg
Ingredients:
- 3 tablespoons olive oil
- 4 (6 ounce) skinless, boneless chicken breast halves
- 1 cup sliced fresh mushrooms
- ¼ cup chopped green onion
- ⅓ cup Marsala wine
- salt and pepper to taste
- ⅓ cup heavy cream
- ⅛ cup milk
vabene1111 commented 2 months ago

ok thanks for the feedback. I know that the test worked for probably years so something changed but maybe the format on all recipes changed as well over the years and my old one just is no longer supported. I will fix the test and if this is actually an edge case provide a fix for recipe scrapers or close this issue.

thanks for investigating!

jayaddison commented 2 months ago

Yep, there is a bug here somewhere. Scraping the test HTML using v14.53.0 produces the following to_json output (with some reformatting for comparison purposes):

{
   "author" : null,
   "canonical_url" : "https://www.allrecipes.com/recipe/24010/easy-chicken-marsala/",
   "category" : "Test_Remove",
   "description" : "Test_Remove",
   "host" : "allrecipes.com",
   "image" : null,
   "ingredient_groups" : [
      {
         "ingredients" : [
            "1 Test_Remove Test_Remove",
            "1 Test_Remove Test_Remove"
         ],
         "purpose" : null
      }
   ],
   "ingredients" : [
      "1 Test_Remove Test_Remove",
      "1 Test_Remove Test_Remove"
   ],
   "instructions" : "Test_Remove",
   "instructions_list" : [
      "Test_Remove"
   ],
   "language" : "en",
   "nutrients" : {},
   "site_name" : null,
   "title" : "Test_Remove"
}

...and scraping it equivalently using v14.54.0 produces:

{
   "author" : null,
   "canonical_url" : "https://www.allrecipes.com/recipe/24010/easy-chicken-marsala/",
   "category" : "Test_Remove",
   "description" : "Test_Remove",
   "host" : "allrecipes.com",
   "image" : null,
   "ingredient_groups" : [
      {
         "ingredients" : [

         ],
         "purpose" : null
      }
   ],
   "ingredients" : [,

   ],
   "instructions" : "Test_Remove",
   "instructions_list" : [
      "Test_Remove"
   ],
   "language" : "en",
   "nutrients" : {},
   "site_name" : null,
   "title" : "Test_Remove"
}

It seems that the ingredients entries are missing for some reason.

jayaddison commented 2 months ago

Ah. Perhaps as mentioned it is no longer the format that the site provides -- in which case the test data has become stale/outdated. Even so, I'll try to figure out the relevant change.

jayaddison commented 2 months ago

Ok, I think that commit ac101d7fad8b6cebe78b9588662761f2661f1101 is the relevant change here - the AllRecipesCurated scraper is updated to use some custom HTML retrieval logic for ingredients, replacing the existing schema.org ingredient retrieval.

Allowing schema.org metadata to be used as a fallback would allow scraping the test HTML to behave as it did before -- but the question is, is that the correct thing to do? In other words: do relevant pages continue to provide the ingredients in that format, or should we be using the HTML-based querying? (and how should we figure that out?)

vabene1111 commented 2 months ago

thats a good question. there was likely a reason why it was switched to html, likely because as many sites there schema is not properly maintained along with the UI, so html does make sense although even that changes often enough.

I guess there is no general answer to be found, we have to decide on a case by case basis. I think we had a similar discussion some while ago with a feature, cant remember exactly which, I think it was the grouping, I would go the same route here: Try the more desirable way (e.g. html scraping) and if that comes up empty try the schema scraping. Alternatively one could do both and take the one that returns the longer array but that could also be wrong so probably the first option.

jayaddison commented 2 months ago

@vabene1111 I think it's OK to restore the schema-as-fallback behaviour, but we'd need at least one fresh HTML URL example for that to use as a test case (so that we know it's valid at time-of-(re)implementation).

vabene1111 commented 2 months ago

ah I see what you mean, I will look trough my test cases if any of the recipes I have (I think I used allrecipes serveral times) behaves like my made up test.

vabene1111 commented 2 months ago

Ok so while updating to v15 we fixed the test. I also checked if I could find any recipes on allrecipes that don't work with the new system and could not so I we removed the old test.

I still feel like whenever html parsing is used on a recipe the schema should be the fallback but thats a design decision I would leave up to you @jayaddison, since there aren't really any test cases besides my old schema there is no real reason to do this, so just close if you feel like it.

Also thanks for your help checking. Really want to get into helping in this project again but have to rewrite the entire frontend of tandoor so pretty busy for a while, but I send people to contribute your way if I feel like they are capable of helping :)

jayaddison commented 2 months ago

Thanks @vabene1111.

About the HTML/schema-org fallback behaviour you mention: are you suggesting that that should happen automatically, for all scrapers, if HTML navigation doesn't find results? That I would not be so keen on -- because it's a kind of hidden behaviour that could be difficult to spot during development/review/debugging/etc. If you're suggesting for individual scrapers (like allrecipes) to try schema.org to do that fallback as part of the implementation of the scraper itself, I think that's OK.

(not sure if I made the difference between those two options clear; let me know if I should try rephrasing that)

vabene1111 commented 2 months ago

perfectly clear, I suggest the second option: if an existing parser that uses schema is changed from schema to html parsing, the schema should stay as a fallback option in case html fails. Probably not really necessary but doesn't really hurt or add any serious complexity. But as mentioned, totally not important.

jayaddison commented 2 months ago

Ok, thank you @vabene1111. I think we can close this (approximately: problem-not-reproducible; current allrecipes HTML not found to cause this error), but feel free to re-open if that's mistaken and/or we find pages that are broken.