orangecoding / fredy

:heart: Fredy - [F]ind [R]eal [E]states [D]amn Eas[y] - Fredy will constantly search for new listings on sites like Immoscout or Immowelt and send new results to you, so that you can focus on more important things in life ;)
http://www.orange-coding.net
MIT License
212 stars 54 forks source link

Immoscout support #20

Closed carstenhag closed 3 years ago

carstenhag commented 3 years ago

Supporting Immoscout would be great :). Tried out a bit and so far it looks good!

carstenhag commented 3 years ago

I realize that in the readme you say it's not supported yet, but that can be changed eventually... perhaps... maybe, right? :D

orangecoding commented 3 years ago

@carstenhag I'd love to support it. In fact I have supported Immoscout a long time, but since a couple of months, they've put a lot of effort into blocking crawlers and bots like Fredy.

They're using a pretty effective way to determining whether or not you're a bot. If the algorithm finds you're a bot, you need to solve a capture.

Here's what I found is happening;

1) Immoscout is using re-capture to apply a score to find out whether you're a bot 2) IF this score is too high, an additional check is applied (a localstorage value is being set) 3) IF this fails and the score is too high, any further request is blocked until you solve a capture

This however is not solvable at the moment. Sure I could trick the localstorage check, however, tricking re-capture would be a whole different level.

I'm working on this with different OpenSource Dev's, but if you have an idea, you're more than welcome to contribute ;)

carstenhag commented 3 years ago

I see, thanks for the extensive answer :). I know some solutions to reCaptcha from other tools (specifically JDownloader2):

orangecoding commented 3 years ago

provide a browser extension or app which are linked to the server. This app/browser extension opens the reCaptcha thing and you can perform the challenge there.

This would make no real sense to me in an app like fredy. The purpose is to run every x minutes and crawl on it's own rather then having human interaction..

integrate captcha solving websites

Yes, I'm looking into something like this, however the problem is that in order to solve re-capture, I'd need to implement the crawler core differently. Currently, my crawler is extremely light-weigh as in, it's only request based, not even a headless brower. With capture solver like the one you mentioned, I'd need to use something like puppeteer, which would introduce different problems (like for instance when you just want to run it on a linux server)

orangecoding commented 3 years ago

currently experimenting with cached versions of immoscout.. maybe this could be a solution..

http://webcache.googleusercontent.com/search?q=cache:immobilienscout24.de

carstenhag commented 3 years ago

Funnily it also works via archive.today, I just recorded https://archive.ph/Gw2qz for example

orangecoding commented 3 years ago

Funnily it also works via archive.today, I just recorded https://archive.ph/Gw2qz for example

yes, unfortunately those snapshots are unreliable and most likely pretty old. the one for instance that you posted is from yesterday. However, this is an interesting thing to look at. 2 questions arise at this point;

1) how can they scrape the whole thing without running into the capture hell 2) Is there a way to make it work for our purpose (by having a more up to date version)

saschnet commented 3 years ago

The purpose of archive.today is to create a snapshot of a website for a specific time. Therefore, the snapshot that carstenhag created, will stay as it was yesterday.

I would suggest to execute a request of that service to an own server and potentially find out how their request looks like at the server side. Unfortunately, I was not able to find any source code of the service and we cannot just copy their strategy.

Theoretically, it would also be possible to use archive.today as a proxy, but I do not think that they would like such a use of their service.

carstenhag commented 3 years ago

Yeah, agree, we can't use archive.today of course as they would be pretty annoyed by us. Just found it interesting that it works for them. I guess they have a browser running, because they also run js.

orangecoding commented 3 years ago

@carstenhag most likely some headless approach like https://pptr.dev/

saschnet commented 3 years ago

If you find a headless browser that works for circumventing the captcha, we could provide it optionally as an additional docker-container. However, my attempts using the most recent version of chrome + Selenium failed so far even when I turned off everything that indicates being in headless mode.

orangecoding commented 3 years ago

@saschnet As I mentioned, they're using re-capture by google. I know a few guys who build that and I know a little bit of the internals, thus I know re-capture works by testing out various things. After all, they build a score. If this score is too high, you're considered a bot. The scrore calculation changes every once in a while to make it harder for ppl to fight agains, thus I'm a bit hopeless tbh.

orangecoding commented 3 years ago

interestingly enough however, when I try adding the search url here and wait for it to take a screenshot it works 0o https://web-capture.net/

orangecoding commented 3 years ago

Got progress... seems like I can bring back the support sooner or later. Needs lot's of polishing and checks whether this approach is working also futurewise, but so far so good.

image

orangecoding commented 3 years ago

Ok, I now have a reproducible (but very experimental) way to support immoscout again. Will push the changes soonish.

image

orangecoding commented 3 years ago

@carstenhag @saschnet I've created a pr to bring back the immoscout support and I would very much appreciate if you could take a look at it. https://github.com/orangecoding/fredy/pull/21

carstenhag commented 3 years ago

I had shortly looked over it, but as I'm not that experienced with js I didn't comment. Thanks for adding the support for immoscout! :)