hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.62k stars 508 forks source link

Scraper request: bergamot.app #986

Open josefhelie opened 6 months ago

josefhelie commented 6 months ago

I'm currently using the free app Bergamot (which is closed source) to store my recipes, but I'd like to move to Mealie. I've encountered an error message that says, 'recipe_scrapers was unable to scrape this URL.' Is it possible to get a scraper, please? 😇 Thanks for your help. A link to a shared recipe: https://dashboard.bergamot.app/shared/T8IJLjbtHdh2pj

jayaddison commented 6 months ago

Hi @josefhelie - thanks for the question / feature request.

In theory, yes this is possible - the webpage is public and represents a recipe. However, there are some potentially important items of information absent on the page: in particular, its origin (from another website? self-authored?) and the instructions.

Do you know whether those details can be included when sharing a recipe like this from the app? It's difficult to develop and test without a few complete samples.

josefhelie commented 6 months ago

i'm sorry I shared a recipe that don't reflect all the requested fields. Here is a better example: https://dashboard.bergamot.app/shared/mIB4jYQtZU1A97 Is it better?

jayaddison commented 5 months ago

Yep, that initially looks good to me @josefhelie - it's difficult to say for certain without coding it up, but it seems to have most/all of the information we'd need. Thanks!

josefhelie commented 5 months ago

Thanks a lot @jayaddison :)

josefhelie commented 3 months ago

May I ask any update on this request @jayaddison? thanks :)

jayaddison commented 3 months ago

Hi @josefhelie - apologies for my delayed reply. No further updates on this at the moment I'm afraid. Do you have any interest in learning some Python coding?

mlduff commented 3 months ago

@jayaddison I took a look, looks like it is fairly easy to call the API endpoint, which can be derived from the URL of the recipe. For https://dashboard.bergamot.app/shared/mIB4jYQtZU1A97 the associated API endpoint is https://api.bergamot.app/recipes/shared?r=mIB4jYQtZU1A97.

I'm not sure how the library normally supports the case of recipes being loaded via an API call after the original page load - I can see a few examples (goustojson.py, monsieurcuisine.py) that seem to do this - I would be happy to tackle this if you are happy to me to do so?

jayaddison commented 3 months ago

Thanks @mlduff!

I'm not sure how the library normally supports the case of recipes being loaded via an API call after the original page load - I can see a few examples (goustojson.py, monsieurcuisine.py) that seem to do this - I would be happy to tackle this if you are happy to me to do so?

About the handling of APIs: yep, well discovered - we do have a few scrapers that retrieve data using APIs at the moment. A potential design/architecture problem with that is that it (currently) tightly-couples the scraper to an HTTP client - namely requests at the moment; nearly a de-facto client for Python, but even so, it may not be ideal to depend entirely on it.

Meanwhile we have a v15 development branch that can optionally use requests, but that otherwise requires callers to retrieve the HTML and pass it to the scraper themselves. Marginally less convenient, but allowing callers to use whatever HTTP client(s) they prefer (anything from built-in urlopen, low-level urllib3, requests, httpx, etc).

A long explanation, but the short answer is: yep, please go ahead, but be aware that this would currently only be supported in the v14 / mainline branch.

jayaddison commented 3 months ago

@mlduff also a design / implementation question for your consideration: those recipes sometimes contain a link to the original source of the recipe. Should we return that as the canonical URL for recipes when possible?

mlduff commented 3 months ago

Meanwhile we have a v15 development branch that can optionally use requests, but that otherwise requires callers to retrieve the HTML and pass it to the scraper themselves. Marginally less convenient, but allowing callers to use whatever HTTP client(s) they prefer (anything from built-in urlopen, low-level urllib3, requests, httpx, etc).

@jayaddison is your preference for me to develop this in the v15 branch? If I implement in v14 (which seems easier), will it then need rewriting at some point (are the other ones like the example I found going to also need similar rewriting?)?

mlduff commented 3 months ago

@mlduff also a design / implementation question for your consideration: those recipes sometimes contain a link to the original source of the recipe. Should we return that as the canonical URL for recipes when possible?

Good point, will try to do that.

jayaddison commented 3 months ago

Meanwhile we have a v15 development branch that can optionally use requests, but that otherwise requires callers to retrieve the HTML and pass it to the scraper themselves. Marginally less convenient, but allowing callers to use whatever HTTP client(s) they prefer (anything from built-in urlopen, low-level urllib3, requests, httpx, etc).

@jayaddison is your preference for me to develop this in the v15 branch? If I implement in v14 (which seems easier), will it then need rewriting at some point (are the other ones like the example I found going to also need similar rewriting?)?

I'd recommend implementing it for v14, yep.

josefhelie commented 3 months ago

Hi @josefhelie - apologies for my delayed reply. No further updates on this at the moment I'm afraid. Do you have any interest in learning some Python coding? Thanks @jayaddison, but i don't have enough free time to do that, even if I would like to!! 😢 Thanks @mlduff too :)

mlduff commented 3 months ago

@jayaddison I noticed that the tests for the two scrapers I mentioned above are located under the legacy section - do I add my tests under there as well?

mlduff commented 3 months ago

@josefhelie are you able to provide a couple more recipe URLs please so I can test?

jayaddison commented 3 months ago

@jayaddison I noticed that the tests for the two scrapers I mentioned above are located under the legacy section - do I add my tests under there as well?

@mlduff yep, that's the correct place for those; thanks for checking :+1: You should be able to configure the expected_requests property in the tests to return example results for both the initial HTML HTTP GET response, and also the subsequent (probably also HTTP GET) API request.

jayaddison commented 3 months ago

@josefhelie have you found any pages shared on Bergamot where the original author is credited? I've seen a few pages that have the domain name of the source URL.. I'm wondering whether there are any that list names/usernames.

josefhelie commented 3 months ago

@jayaddison I'm not sure I have. Would it help you if you provide me a recipe I could import into Bergamot and then give you the link towards the imported recipe?

mlduff commented 2 months ago

@josefhelie Here is one that has an author https://www.bestrecipes.com.au/recipes/peanut-butter-cookies-recipe/fowk6kuy

josefhelie commented 2 months ago

I imported it in my Bergamot, here it is: https://dashboard.bergamot.app/shared/REbGkQaNoVJ5kM

jayaddison commented 2 months ago

Thanks @josefhelie - so roughly speaking, it seems like some source recipes may include author info, and the Bergamot page includes a link back to the original, but our scraper can't directly retrieve the author details at the moment (they're not in the Bergamot page, so it seems like we'd have to ask Bergamot to add those, or to retrieve them ourselves from the original URL).

I'm not completely sure what to do here; I personally place quite a lot of important on retaining the author name/info (even though it's challenging sometimes) because my assumption is that a lot of recipe authors themselves would want that to be included when people view their recipes.

I haven't contacted Bergamot to ask whether they'd consider attempting to include that info themselves, so that's one option I'm considering. Is there a support/feedback option in the app itself?