hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

add americastestkitchen.com #1060

Closed smilerz closed 2 months ago

smilerz commented 2 months ago

ATK is behind a paywall - so this only makes sense to add as part of v15

smilerz commented 2 months ago

I have three questions/requests for feedback on the latest commit. Is there a way to use HTMLTagStripperPlugin here? headnote = f"Note: {normalize_string(re.sub(r'<.*?>', '', headnote))}\n"

Is there a way to avoid duplicating the ingredient formatting code between ingredients() and ingredient_groups()?

And finally - in ingredients, sometimes they preface the post_text with ", ...." and sometimes they don't. Is there an elegant way to preface the post text with a space if there isn't some sort of delimiter?

smilerz commented 2 months ago

One more question: ATK also has a format for instruction_groups(), but they aren't correlated to the ingredient_groups(). As an example their recipe for lasagna has 3 groups of ingredients named 'meat sauce', 'bechamel, 'noodles and cheese'.

There are 5 instructions groups with no names assigned, but can be eyeballed to include the above 3 assembly steps and 2 intermediate steps on the actual baking.

I'm not sure if it's worth updating the instruction_groups() or not.

jayaddison commented 2 months ago

Is there a way to use HTMLTagStripperPlugin here?

I'm not too familiar with that feature to be honest - but I'll investigate that soon and then will let you know (unless someone else gets to that before me).

Is there a way to avoid duplicating the ingredient formatting code between ingredients() and ingredient_groups()?

Yes, for this there is a helper utility - I'd recommend referring to the _grouping_utils.py documentation for that.

And finally - in ingredients, sometimes they preface the post_text with ", ...." and sometimes they don't. Is there an elegant way to preface the post text with a space if there isn't some sort of delimiter?

Hrm, that's annoying. Maybe we could trim the left of the postText until we find an alphanumeric character; there's no common functionality in the library to do that, though, as far as I'm aware.

smilerz commented 2 months ago

I looked at the _grouping_utils.py it is using soup and HTML elements, but the grouping for ATK is already structured json data so is parsed differently. Or maybe I don't fully understand how so use the util?

jayaddison commented 2 months ago

One more question: ATK also has a format for instruction_groups(), but they aren't correlated to the ingredient_groups(). As an example their recipe for lasagna has 3 groups of ingredients named 'meat sauce', 'bechamel, 'noodles and cheese'.

There are 5 instructions groups with no names assigned, but can be eyeballed to include the above 3 assembly steps and 2 intermediate steps on the actual baking.

I'm not sure if it's worth updating the instruction_groups() or not.

Interesting!

We don't support groups of instructions at the moment - only groups of ingredients. However, both do make sense in recipes, and it's good to have found an example of that -- and to note the possibility for them to be linked in some cases.

I'm thinking about what to do here, too.

smilerz commented 2 months ago

Oh, thanks for the clarification - I misread what instructions_list() was all about (and misrememberd the name)

jayaddison commented 2 months ago

That's OK, they are similar names! :)

jayaddison commented 2 months ago

I looked at the _grouping_utils.py it is using soup and HTML elements, but the grouping for ATK is already structured json data so is parsed differently. Or maybe I don't fully understand how so use the util?

Of course; I forgot that we're using JSON input here. No, I think you've got it - the grouping utils are intended for use with HTML, and they use CSS-based selector queries.

I'll try experimenting with the scraper code here to explore other possibilities.

smilerz commented 2 months ago

Thanks - I appreciate the collaboration.

jayaddison commented 2 months ago

You're welcome - thanks for the pull request. I'll provide some more review comments within the next day or two.

smilerz commented 2 months ago

I saw the tests failed - my mistake, I forgot to update expected values after stripping the leading zeros from commas.

jayaddison commented 2 months ago

@smilerz this has been included and released in v14.56.0 (and also the pre-release v15.0.0-rc3) of recipe-scrapers on PyPi.

smilerz commented 2 months ago

@smilerz this has been included and released in v14.56.0 (and also the pre-release v15.0.0-rc3) of recipe-scrapers on PyPi.

It doesn't look like it's in 14.56.0 - which is what I expected since it's behind a paywall.

jayaddison commented 2 months ago

Ah, oops - yep, my mistake (that notification was copy-pasted; I forgot about the different circumstance here :/ ). Thanks @smilerz.