Open vabene1111 opened 7 months ago
cc @mlduff (re: #1063, mentioning you so that you appear in the possible assignees list for this bug)
Ok, that didn't work (makes sense, I suppose). Perhaps issues can only be assigned to people who have interacted with the thread themselves (not by others on their behalf).
Hopefully now it works @jayaddison
@jayaddison I have an issue with one of the test cases when implementing this site. The ingredients_groups
test assumes that the groups are a split of the ingredients, however it seems to be falling over in this case because the groups use different amounts of the same ingredient (which adds up to the total in the overall ingredients).
Here is a snapshot of my unit test with the expected results:
"ingredients": [
"558 g Weizenmehl 550",
"389 g Wasser",
"90 g Weizenanstellgut TA 200 (weich)",
"13 g Salz"
],
"ingredient_groups": [
{
"ingredients": [
"90 g Wasser",
"90 g Weizenmehl 550",
"90 g Weizenanstellgut TA 200 (weich)"
],
"purpose": "Weizensauerteig"
},
{
"ingredients": [
"13 g Salz",
"298 g Wasser",
"467 g Weizenmehl 550",
"gesamter Weizensauerteig"
],
"purpose": "Hauptteig"
}
],
As far as I can tell, the options I have are:
ingredient_groups
implementationingredients
return the separated version of the ingredients (by concatenating the ingredient_groups
)I read https://github.com/hhursev/recipe-scrapers/blob/main/docs/in-depth-guide-ingredient-groups.md, and in particular this excerpt:
The ingredients found in ingredients() and ingredient_groups() should be the same because we're presenting the same set of ingredients, just in a different way. There can sometimes be minor differences in the ingredients in the schema and the ingredients in the HTML which needs to be handled.
I think that in my case if we kept my current output as is, it would still be satisfying this requirement as it is still the same ingredients, just separated out. Therefore, I think the best solution is option 3, however I would like to get your input on if that is the best option/desired output, and also how I could tackle this.
Scraped page: https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768
Interesting - thanks @mlduff! In a way I'm surprised that we haven't encountered this problem before already.
To check: is is this part of the test logic that is failing: https://github.com/hhursev/recipe-scrapers/blob/3273fab843ef289c53d948c7cb03a3c14abe1311/tests/__init__.py#L113-L117 ?
That is correct @jayaddison
Ok, thanks @mlduff. I agree that option three (handling this case within test coverage) would be ideal. However, I can't initially figure out how we'd handle this in out tests. Are you suggesting trying to parse and compare the quantities in the ingredient lines (grouped vs total), or to override the test for this individual case - or pehaps something else?
@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).
If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.
Mmm, ok. I'm continuing to think about it. If we had one Python class per scraper unit test, then perhaps what we would naturally do here would be to override the logic that checks for this ingredient-group-ingredient-list equivalence.
My sense is that something similar could be appropriate here. There's definitely possible value in adding a dev-dependency and potentially contributing improvements back there as a result of what we find; but there are also benefits in keeping developer dependencies lightweight.
@jayaddison I could move this test to the legacy tests so I can override if you like? Otherwise could we look at doing a non-legacy similar approach?
I'm also happy to investigate trying to parse the ingredients with the dev-dependencies if you would like me to.
@mlduff please feel free to explore any possibilities; all of those options seem like they'd provide ways to learn more about the problem and possible ways to resolve it - including the related advantages and disadvantages of each.
At the moment I don't have any further recommendations, but I'll continue thinking about it. It's not always the case, but sometimes taking a week or two to consider a problem can help to develop different ways of thinking about it that ultimately help to figure it out properly -- or at least help you to decide what the best option is from those available.
@jayaddison I have just submitted the PR for this - I went for option 4, and devised a way to exempt this scraper from that particular test. If I get time I'll further investigate ingredients-parser, however this seems to be like there still could be edge cases it falls over in (its a model that spits out confidences and the like), so I thought it probably isn't suitable for a unit test.
@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).
If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.
This does seem like a good approach, if possible. It would mean that we don't have to add an override option / subclass the test / provide alternative implementations per-scraper.
It seems there are at least two libraries available:
ingredient-parser
(thanks for mentioning that, I hadn't seen it before)ingredient-slicer
(mentioned in #733 recently)As you say, I do think that these should be dev-dependencies only for this use-case; let's prove the functionality and figure out a library that we like for test/internal purposes, and then perhaps in future we could provide parsed quantities on ingredients.
A popular german bread baking recipe website with no schema but very nice recipes
https://www.ploetzblog.de/rezepte/rustikales-weizenmischbrot/id=61b843e6f51707630ab85911 https://www.ploetzblog.de/rezepte/hafer-dinkel-brot/id=618a284828ae7154616ab156 https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768