hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

Add bergamot #1064

Open mlduff opened 2 months ago

mlduff commented 2 months ago

This scraper adds support for dashboard.bergamot.app. Would of liked a few more test URLs, but used the following in testing.

Had to implement a legacy test - not sure if I used the right method to specify the origin URL (which is different to the actual request URL). I ended out having a yield for the first URL (that the user would pass in), with the associated testhtml not actually being used.

Resolves https://github.com/hhursev/recipe-scrapers/issues/986

jayaddison commented 2 months ago

Had to implement a legacy test - not sure if I used the right method to specify the origin URL (which is different to the actual request URL). I ended out having a yield for the first URL (that the user would pass in), with the associated testhtml not actually being used.

That seems OK to me @mlduff - a web browser would load the testhtml HTML page and then load any related JavaScript, and subsequently should make the JSON request you found previously. In addition, people using this library would likely expect to pass HTML from a webpage in -- meaning that they've made a HTTP GET request to request the relevant dashboard page, and then the scraper will make the subsequent HTTP GET. So the test describes the expected usage and behaviour, which is what we want.

jayaddison commented 2 months ago

This generally looks good to me, thank you @mlduff - a few small suggestions/questions and then I'll re-review.

mlduff commented 2 months ago

I had a look through the JS on the site and I have added cook_time as I believe it should be implemented based on:

{className:"recipe__time_title"},n("Prep")),a.a.createElement("div",{className:"recipe__time_content"},Ce(r.time.prepTime))),!!r.time.cookTime&&a.a.createElement("div",{className:"recipe__time column"},a.a.createElement("div",{className:"recipe__time_title"},n("Cook")),a.a.createElement("div",{className:"recipe__time_content"},Ce(r.time.cookTime))),!!r.time.totalTime&&a.a.createElement("div",{className:"recipe__time column"},a.a.createElement("div",

Couldn't find author using the same method @jayaddison

jayaddison commented 2 months ago

Ok, thanks @mlduff. I think retaining the authorship info is fairly important, so I'd like to pause until we get some more ideas / suggestions.

mlduff commented 2 months ago

Ok, thanks @mlduff. I think retaining the authorship info is fairly important, so I'd like to pause until we get some more ideas / suggestions.

@jayaddison would it be completely off the cards to do an extra request to the origin website? That's the only way I can see getting any extra information as it seems bergamot does not support author at all.

If you are satisfied that bergamot doesn't support author whatsoever, would you merge the PR? Or does this mean we have to decline it?

jayaddison commented 2 months ago

@mlduff hmm, I'm not sure that I'd be entirely happy with making an additional request either; it would seem kinda unexpected for some code that appears to scrape from HTML at a URL to make requests to other domains -- and fairly arbitrarily, based on whatever the response of Bergamot included. I don't expect Bergamot would intentionally link to non-related sites for any reason, but that kind of thing can happen and can cause security/privacy vulnerabilities in web browsers; so I'd prefer to maintain simplicity and avoid that.

mlduff commented 2 months ago

@jayaddison yeah that makes sense. Is there anything I can do to help get this one in, or do we just have to wait for bergamot to reply and see if they will add author? If they say no, can we proceed with just the source domain?

mlduff commented 2 months ago

@jayaddison Sorry, just bumping on the above - I have implemented the source domain for the author field - is it okay to proceed with this PR?

jayaddison commented 2 months ago

Thanks @mlduff - no, I don't want to proceed with this given that it isn't providing attribution for an author name that we know exists on an origin recipe. Source domain could arguably provide some of that, but I'd prefer to wait until we can retrieve the author credit from the page we're scraping.

mlduff commented 2 months ago

Should I closethis PR for now @jayaddison? Or leave it up?

jayaddison commented 1 month ago

Should I closethis PR for now @jayaddison? Or leave it up?

I've been on the other side of this situation a few times now -- providing a contribution to a project that it is unclear will ever be accepted, sometimes for a reason that they've communicated, or sometimes simply because maintainers are unavailable -- and my usual preference is to keep the pull request open, because it's:

However: it can be frustrating sometimes if your PR doesn't get merged for a long time. Usually I ping the maintainers once every so often (say once per 3 months or so) if I think the PR is still valid -- but sometimes I just close them to reduce my mental overhead if it's been a long time with no progress.