hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

[Add] Plötzblog Scraper #1062

Open vabene1111 opened 2 months ago

vabene1111 commented 2 months ago

A popular german bread baking recipe website with no schema but very nice recipes

https://www.ploetzblog.de/rezepte/rustikales-weizenmischbrot/id=61b843e6f51707630ab85911 https://www.ploetzblog.de/rezepte/hafer-dinkel-brot/id=618a284828ae7154616ab156 https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768

jayaddison commented 2 months ago

cc @mlduff (re: #1063, mentioning you so that you appear in the possible assignees list for this bug)

jayaddison commented 2 months ago

Ok, that didn't work (makes sense, I suppose). Perhaps issues can only be assigned to people who have interacted with the thread themselves (not by others on their behalf).

mlduff commented 2 months ago

Hopefully now it works @jayaddison

mlduff commented 2 months ago

@jayaddison I have an issue with one of the test cases when implementing this site. The ingredients_groups test assumes that the groups are a split of the ingredients, however it seems to be falling over in this case because the groups use different amounts of the same ingredient (which adds up to the total in the overall ingredients).

Here is a snapshot of my unit test with the expected results:

"ingredients": [
    "558 g Weizenmehl 550",
    "389 g Wasser",
    "90 g Weizenanstellgut TA 200 (weich)",
    "13 g Salz"
  ],
  "ingredient_groups": [
    {
      "ingredients": [
        "90 g Wasser",
        "90 g Weizenmehl 550",
        "90 g Weizenanstellgut TA 200 (weich)"
      ],
      "purpose": "Weizensauerteig"
    },
    {
      "ingredients": [
        "13 g Salz",
        "298 g Wasser",
        "467 g Weizenmehl 550",
        "gesamter Weizensauerteig"
      ],
      "purpose": "Hauptteig"
    }
  ],

As far as I can tell, the options I have are:

  1. Get rid of the ingredient_groups implementation
  2. Make the overall ingredients return the separated version of the ingredients (by concatenating the ingredient_groups)
  3. Updating the test to account for this difference
  4. Exempt this scraper from that particular test

I read https://github.com/hhursev/recipe-scrapers/blob/main/docs/in-depth-guide-ingredient-groups.md, and in particular this excerpt:

The ingredients found in ingredients() and ingredient_groups() should be the same because we're presenting the same set of ingredients, just in a different way. There can sometimes be minor differences in the ingredients in the schema and the ingredients in the HTML which needs to be handled.

I think that in my case if we kept my current output as is, it would still be satisfying this requirement as it is still the same ingredients, just separated out. Therefore, I think the best solution is option 3, however I would like to get your input on if that is the best option/desired output, and also how I could tackle this.

Scraped page: https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768

jayaddison commented 2 months ago

Interesting - thanks @mlduff! In a way I'm surprised that we haven't encountered this problem before already.

To check: is is this part of the test logic that is failing: https://github.com/hhursev/recipe-scrapers/blob/3273fab843ef289c53d948c7cb03a3c14abe1311/tests/__init__.py#L113-L117 ?

mlduff commented 2 months ago

That is correct @jayaddison

jayaddison commented 2 months ago

Ok, thanks @mlduff. I agree that option three (handling this case within test coverage) would be ideal. However, I can't initially figure out how we'd handle this in out tests. Are you suggesting trying to parse and compare the quantities in the ingredient lines (grouped vs total), or to override the test for this individual case - or pehaps something else?

mlduff commented 2 months ago

@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).

If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.

jayaddison commented 2 months ago

Mmm, ok. I'm continuing to think about it. If we had one Python class per scraper unit test, then perhaps what we would naturally do here would be to override the logic that checks for this ingredient-group-ingredient-list equivalence.

My sense is that something similar could be appropriate here. There's definitely possible value in adding a dev-dependency and potentially contributing improvements back there as a result of what we find; but there are also benefits in keeping developer dependencies lightweight.

mlduff commented 2 months ago

@jayaddison I could move this test to the legacy tests so I can override if you like? Otherwise could we look at doing a non-legacy similar approach?

I'm also happy to investigate trying to parse the ingredients with the dev-dependencies if you would like me to.

jayaddison commented 2 months ago

@mlduff please feel free to explore any possibilities; all of those options seem like they'd provide ways to learn more about the problem and possible ways to resolve it - including the related advantages and disadvantages of each.

At the moment I don't have any further recommendations, but I'll continue thinking about it. It's not always the case, but sometimes taking a week or two to consider a problem can help to develop different ways of thinking about it that ultimately help to figure it out properly -- or at least help you to decide what the best option is from those available.

mlduff commented 2 months ago

@jayaddison I have just submitted the PR for this - I went for option 4, and devised a way to exempt this scraper from that particular test. If I get time I'll further investigate ingredients-parser, however this seems to be like there still could be edge cases it falls over in (its a model that spits out confidences and the like), so I thought it probably isn't suitable for a unit test.

jayaddison commented 2 months ago

@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).

If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.

This does seem like a good approach, if possible. It would mean that we don't have to add an override option / subclass the test / provide alternative implementations per-scraper.

It seems there are at least two libraries available:

As you say, I do think that these should be dev-dependencies only for this use-case; let's prove the functionality and figure out a library that we like for test/internal purposes, and then perhaps in future we could provide parsed quantities on ingredients.