BettyBossi is not working

Danit2 commented 8 months ago

Pre-filing checks

[x] I have searched for open issues that report the same problem
[x] I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

Kokos-Schokolade-Würfel

The results you expect to see

I use "Mealie" on Home Assistant and there the Betty Bossi Website is not working. In the repositories from Mealie they say this is a problem from the recipe-scrapers

The results (including any Python error messages) that you are seeing

I become a error message from Mealie.

jayaddison commented 8 months ago

Hi @Danit2 - thank you for the bugreport, we should be able to investgate this soon.

There are two details that would be helpful to narrow this down, if available:

Does mealie indicate the verson of recipe-scrapers that is in use? (I would guess it will look something like v14.50.1 or similar)
Are there any details in the error message? (such as Failed to retrieve recipe title or similar)

Thanks!

Danit2 commented 8 months ago

Hi @jayaddison

Thanks for your answer.

My version of Mealie use the recipe-scrapers version 14.55.0

On the Logs i don't see anything. I'am Sorry.

INFO: 17-Mar-24 14:57:48    HTTP Request: GET https://www.bettybossi.ch/de/Rezept/ShowRezept/BB_BBZI201015_0003A-40-de?title=Steinpilz-Risotto "HTTP/1.1 200 OK"
INFO: 17-Mar-24 14:57:48    HTTP Request: GET https://www.bettybossi.ch/de/Rezept/ShowRezept/BB_BBZI201015_0003A-40-de?title=Steinpilz-Risotto "HTTP/1.1 200 OK"
127.0.0.1:38902 - "POST /api/recipes/create-url HTTP/1.1" 400
[17/Mar/2024:14:57:48 +0100] 400 164.14.140.15, 172.30.33.17(172.30.32.1) POST /api/recipes/create-url HTTP/1.1 (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36)

I only become this error message.

I hope you can help.

Thanks.

jayaddison commented 8 months ago

Extremely helpful, thank you @Danit2 - I hope to investigate this within the next day or so.

jayaddison commented 8 months ago

Ok, this is an interesting bug. I think what is happening here is that:

Mealie requests the recipe page from BettyBossi.
BettyBossi responds to some/all requests with a tiny HTML page containing a JavaScript snippet that redirects to the recipe (this can be an effective bandwidth/bot-reduction technique).
Mealie receives the minimal redirect page as HTML, but the HTTP client it uses (httpx) - like many/most Python HTTP clients - does not evaluate the JavaScript code, so the tiny HTML (with no recipe content) is returned.
recipe-scrapers received the tiny HTML page and doesn't find the recipe information in there.

My guess is that if a user-agent followed the redirect to get to the recipe URL, and downloaded the HTML from that second page, then recipe-scrapers would be able to extract the recipe metadata.

I'll have to spend a bit of time to think about this. It could be good to double-check this theory, too, if anyone out there has time to help.

Zwirbel1 commented 8 months ago

I would be willing to help solving the problem with Betty Bossi, though I am not a developer.

jayaddison commented 8 months ago

@Zwirbel1 if you have time, then if you could check whether any open source recipe management / import utilities are able to handle BettyBossi could be useful info for this, to get an idea for whether the same problem has been solved elsewhere (and perhaps how).

Zwirbel1 commented 7 months ago

I have tested it last week with Tandoor, which was able to import a recipe from Betty Bossy in the demo version online: https://docs.tandoor.dev/. Here's the menu I have imported into the demo version of Tandoor: https://app.tandoor.dev/view/recipe/53071.

SwissOS commented 6 months ago

It seems like bettybossi.ch uses anti-scraping techniques and as mentioned already in this issue (https://github.com/hhursev/recipe-scrapers/issues/531) you need to reload the page 2 times in order to get the correct HTML.

Zwirbel1 commented 6 months ago

@SwissOS : I tried reloading the page various times, but I get the same URL and HTML, which does not allow me to import the recipe. Anything else I can change to get the correct HTML / URL?

hhursev / recipe-scrapers

BettyBossi is not working #1028