orangecoding / fredy

:heart: Fredy - [F]ind [R]eal [E]states [D]amn Eas[y] - Fredy will constantly search for new listings on sites like Immoscout or Immowelt and send new results to you, so that you can focus on more important things in life ;)
http://www.orange-coding.net
MIT License
209 stars 54 forks source link

Residential and datacenter strategy for Immobilienscout24 scraping #56

Closed kami4ka closed 2 years ago

kami4ka commented 2 years ago

Is your feature request related to a problem? Please describe. Datacenter scraping for Immobilienscout24 is successful too but may require more retries and a bit slower, while residential is faster and more expensive.

Describe the solution you'd like Allow 2 strategies for Immobilienscout24 scraping: 1) Datacenter-only - retry N times with datacenter proxies (note: also retry when the status code is 404, it's a known behavior for this specific proxy pool) 2) Residential-included - try with datacenter (better with retries) and then switch to residential

So it would be possible to decide whether use residential or not, but the retry would always apply.

Additional context Datacenter-only approach would always return a successful result, but it might take some time, while the residential-included approach would be faster and more expensive. It's a result of ScrapingAnt's custom proxy pool feature applied for Fredy.

denisalevi commented 2 years ago

I think this would make a lot of sense. Using residential proxies costs 250 credits per call, while datacenter proxies cost only 10 with ScrapingAnt. That gives you just above 1 call per day in the free plan and even with the 100k credit plan you end up with less than 14 calls per day. That makes using residential proxies unfeasible IMHO. I ended up downgrading to version 5.5.0 and adding some retries, which works fine.

kami4ka commented 2 years ago

@denisalevi could you, please, create a PR to this repo?

denisalevi commented 2 years ago

Hi @kami4ka, I'm not a js developer and its just some hacky lines added to an older version of the repo. I'll try to clean it up or make it somehow available soon!

But from a quick look at the current repo version, it looks like it is all there in requestDriver.js. From a quick look I didn't get the logic entirely though. Could it be that something is missing there? An option to set MAX_RETRIES_SCRAPING_ANT and not try residential proxies should be possible? @orangecoding Any thoughts? :)

orangecoding commented 2 years ago

I'm going to give it a shot in a couple of days, currently I have some private responsibilities I have to deal with. Once this is sorted, I'm coming back to this :)

orangecoding commented 2 years ago

@kami4ka If I understand you correctly, you suggest to make the use of residental proxies an option and let the user decide whether they want to have a faster and more stable solution or whether they want to have a cheaper one?

I do like this approach tbh

kami4ka commented 2 years ago

@orangecoding Yup. Exactly. So retry mechanism would remain the same, only the proxy type is changeable

denisalevi commented 2 years ago

I would maybe add an option to set the number of retries when using residential proxies? I think with 3 retries (current setting?) it fails quite often. I have 8 retries and I think it always succeeds with those. I can check the logs again next week, its running on a Pi that I don't have access to right now.

kami4ka commented 2 years ago

Yeah. I guess it's also can be a proxy-type independent option.

orangecoding commented 2 years ago

@kami4ka what was again the comment about retries.. I remeber you once told me that if the return value is != 200 meaning if no success, the customer is not charged. Is this still true? If this is true, I don't know if it makes sense to make the number of retries configurable, but rather setting it to let's say 10. Of course only if this doesn't cost 250 credits per retry ;)

orangecoding commented 2 years ago

I have just added an option to configure the proxies (still to be finished) and once again noticed... I am no designer at all.. 👯

image
kami4ka commented 2 years ago

@kami4ka what was again the comment about retries.. I remeber you once told me that if the return value is != 200 meaning if no success, the customer is not charged. Is this still true? If this is true, I don't know if it makes sense to make the number of retries configurable, but rather setting it to let's say 10. Of course only if this doesn't cost 250 credits per retry ;)

Yup. That's true. Every non-200 response from ScrapingAnt is not billable.

kami4ka commented 2 years ago

I have just added an option to configure the proxies (still to be finished) and once again noticed... I am no designer at all.. 👯 image

10.000 free API credits :-)

orangecoding commented 2 years ago

Ahh right. Thanks.

orangecoding commented 2 years ago

@denisalevi @kami4ka I have added all necessary changes, would you mind taking a look and do a quick review? https://github.com/orangecoding/fredy/pull/59/files

It's pretty straight forward:

denisalevi commented 2 years ago

Great, thanks a lot! I'll try it out tonight :)