orangecoding / fredy

:heart: Fredy - [F]ind [R]eal [E]states [D]amn Eas[y] - Fredy will constantly search for new listings on sites like Immoscout or Immowelt and send new results to you, so that you can focus on more important things in life ;)
http://www.orange-coding.net
MIT License
231 stars 58 forks source link

Immoscout scraping started failing (almost?) always #63

Closed denisalevi closed 9 months ago

denisalevi commented 1 year ago

It appears that scraping immoscout stopped working reliably. This is not actually a bug, but maybe changes at Immoscout? Either way, all my Immoscout Providers keep failing, both, with datacenter and residential proxies. I don't think I've seen a successful Immoscout scrape in days (but I also was on my own patched fork before, now I updated to current master and it's the same, all retries seem to be failing).

Can someone check or reproduce this? Or might this be some problem on my side? Maybe @kami4ka can shed some light from the scraping ant side? :)

kami4ka commented 1 year ago

Unfortunately, it fails always. We're acknowledged of this situation and currently trying to fight it out. We've already changed the technology behind the service and improved the detection rate for various different websites while we were working on this issue, but not immoscout yet. Still, we're still in progress and would notify all the users (who tried to make a request to immoscout) via email.

orangecoding commented 1 year ago

It's honestly a fight against windmills.

I know there are hundreds of Fredy user out there coz I keep getting emails about ppl asking me to fix the immoscout scraper...

denisalevi commented 1 year ago

Thanks for the information @kami4ka

And yeah, I can imagine @orangecoding. Fredy is a real game changer, especially in Berlin, where every second counts (it saved my ass a couple of months ago). And I can imagine that Immoscout is constantly changing. But I have to say, considering that, Fredy has been running quite smoothly for the last months, thanks for that! I have it set up for a few friends, who share the scraping ant fee (currently stopped until it is working again). I just saw your sponsoring option. I'll make sure to include you in the shared costs once we are up again :)

If there is anything I can contribute, please let me know. It's just not my expertise at all unfortunately.

kami4ka commented 1 year ago

The latest update is that we've found a way to fix it and bypass it. We're going to test and prepare everything for the cluster deployment (some stuff is still unclear at that part) and reach anyone who made Immoscout calls over the last two months via email.

orangecoding commented 1 year ago

Awesome.

Can you share with us how many user we are taking about? @kami4ka

phil-bergmann commented 1 year ago

Hey! First of all thanks for the amazing project :) I was trying around how to evade the immoscout restrictions and tested these approaches:

Maybe the info helps, but having to render the browser is a bit of a bummer for easy deployment. And this undetected_chromedriver library only is in python and does some fancy stuff I do not completely understand.

orangecoding commented 1 year ago

Hi phil,

Thanks. As I said earlier this is a cat and mice game.

We might be able to overcome this by using unprotected api endpoints. However this too might be something that only works for a limited amount of time..

phil-bergmann commented 1 year ago

Hey @orangecoding,

agreed it is a very nasty cat and mice game with the other side having probably a lot more developers than we have here working on this project. But I mean if we somehow manage to use a chrome based browser using a package like undetected_chromedriver with rendering the screen it will be very difficult to detect that without blocking "legitimate" users out of immoscout. The only problem with that I still haven't found a way to run that in docker. Unprotected API endpoints will get fixed for sure at some point and I guess immoscout is probably even monitoring repos like this one here ;)

Lukewa commented 1 year ago

Hi @phil-bergmann, can you provide your approach with undetected_chromedriver? Would be nice to give it a try. I've also seen your approach with ScrapingBee, but would like to avoid the payed account.

orangecoding commented 1 year ago

Immoscout is working for me every now and then. @kami4ka Do you have an update for us?

mygrexit commented 1 year ago

Stumbled upon this project today and was asking myself the same thing. I really hope this gets fixed. @kami4ka I would subscribe right away!

orangecoding commented 1 year ago

For some reason, @kami4ka is currently unavailable. I hope he's doing ok as he's from the ukraine... In the meantime, I see that nearly all my tests were successful after a couple of retries.

Can you guys confirm?

kami4ka commented 1 year ago

Sorry for the delay.

We've a bit stuck with moving our PoC for this detection to the production environment, so it's getting delayed. We're doing our best, as it would allow us to cover more protections like this, so it's our top priority.

I'll keep you updated once we'll figure it out totally.

liebecode commented 1 year ago

hey @kami4ka, just wondering if there is an update available for this? I notice immoscout is not able to be used; it never finds any listings. Thank you!

ilindaniel commented 1 year ago

ScrapingBee (not ScrapingAnt) and Zyte API are able to scrape Immoscout.

A request on ScrapingBee with a "stealth proxy" costs approx. $0.04 while Zyte API costs $0.008

orangecoding commented 1 year ago

Yeah I am also considering providing different solutions.. not sure however whether to replace scrapingant or just add scrapingbee

orangecoding commented 1 year ago

@ilindaniel By the way, I was trying to use ScrapingBee to scrape Immoscout (used it on their website) but hit the bot detection every time. Are you totally sure, scrapingBee found a way around it? I honestly don't want to implement various services just to see that they too don't work

ilindaniel commented 1 year ago

Have you checked the "stealth proxy" checkbox?

Nevertheless I'd suggest to have a look at Zyte since they are 5x cheaper than ScrapingBee

kami4ka commented 1 year ago

Hey guys. I'd suggest you trying out ScrapeOps: https://scrapeops.io/proxy-aggregator/ They are aggregating web scraping providers and it could be the best way for such cases.

Each provider could have similar tech, but still different (for example, of how a browser executes in the cluster), so it would allow not to tight with some particular one, but aggregate all of them.

You can check more at landing page.

orangecoding commented 1 year ago

@kami4ka I tried them (as well as a bunch of others) however I always hit the wall. {"status":"Failed to get successful response from website. Please retry the request."}

To be quite honest with you I am sick and tired of this cats and mice game and currently thinking about totally removing the support for immoscout.

kami4ka commented 1 year ago

@orangecoding Yeah, I totally understand you We always suggest finding an alternative data source when the cost of the specific data-source extraction becomes a problem, including the detection avoidance creation cost. Unfortunately, it looks like it is a case with Immoscout too.

HerzogVonWiesel commented 11 months ago

As of now, immoscout still doesn't work right? Or am I missing something in my setup? Cheers and thank you!

orangecoding commented 11 months ago

No and it doesn't seem like @kami4ka is having much trust in fixing this.

I was recently playing around with ai to overcome the capture but there is actually a legal issue.

See scraping is ok-ish until you do not harm the website OR you are not trying to defeat things that have been put in place in order to block scraping. Like captures.

And tbh, I don't want to mess with them.. ;)

ilindaniel commented 11 months ago

Zyte is still able to scrape ImmoScout:

278884897-90d9e1af-82b0-45fd-9f33-b17cd14815a9

However I'm quite lazy and use ImmoScout's email notification service at the moment. Might not be as instant as scraping it, but that's the quick fix for now.