Closed albe-jj closed 2 years ago
Interesting. One thing you could try is oepning the inspector in Chrome, looking at the network requests and exporting the relevant one to cURL or similar. From there, hyhou can try to replicate the request exactly (minus the IP address). This will tell you whether there is some blacklisting going on.
Thanks for the reply. I tried as you suggested to export the exact same request my browsers send, and send it from a lambda function and from my laptop. I get redirected to "we think you are a bot" page every time I do it with lambda while the request goes through if I send it from my home IP address.
I also tried connecting with a VPN from a different country to see if it was just a location issue, but my request goes through from IP located outside the Netherlands.
At this point I think pararius.com considers suspicious any request coming from an amazon IP.
Do you have any other idea for testing this? Any suggestion for an alternative way to rotate IP? residential proxies services seem quite expensive for the small project I want to run.
It's quite possible they have blacklisted Amazon IPs. Another quick test would be to spin up an EC2 and execute the cURL request on the command line, and do the same thing from your home computer - just to rule out anything wrong in the Python code. Otherwise I don't think there is a very cheap solution. Can you send me an example URL that you are trying to scrape? I will try to have a look.
On Mon, Feb 14, 2022 at 9:28 PM Alberto Tosato @.***> wrote:
Thanks for the reply. I tried as you suggested to export the exact same request my browsers send, and send it from a lambda function and from my laptop. I get redirected to "we think you are a bot" page every time I do it with lambda while the request goes through if I send it from my home IP address.
I also tried connecting with a VPN from a different country to see if it was just a location issue, but my request goes through from IP located outside the Netherlands.
At this point I think pararius.com considers suspicious any request coming from an amazon IP.
Do you have any other idea for testing this? Any suggestion for an alternative way to rotate IP? residential proxies services seem quite expensive for the small project I want to run.
— Reply to this email directly, view it on GitHub https://github.com/teticio/lambda-scraper/issues/2#issuecomment-1039581452, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRPDBZWF64GCDMSR34RZWDU3FXYNANCNFSM5OIPI3FA . You are receiving this because you commented.Message ID: @.***>
Hi Robert, thank you for the reply. Good idea, I will try to send it from EC2 later today. Below the request.
curl 'https://www.pararius.com/apartments/rotterdam/page-1' \ -H 'authority: www.pararius.com' \ -H 'cache-control: max-age=0' \ -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="98", "Google Chrome";v="98"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "Windows"' \ -H 'dnt: 1' \ -H 'upgrade-insecure-requests: 1' \ -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36' \ -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9' \ -H 'sec-fetch-site: same-origin' \ -H 'sec-fetch-mode: navigate' \ -H 'sec-fetch-user: ?1' \ -H 'sec-fetch-dest: document' \ -H 'referer: https://www.pararius.com/english' \ -H 'accept-language: en-GB,en;q=0.9,en-US;q=0.8,it;q=0.7' \ -H 'cookie: fl_cp_pass_a=eyJLZXkiOiIyTk0zQkJUN0tUNkpDQ0VKVVVJR1JZQkhGWlBKQk8yVyIsIlBhc3MiOiJHR1QzRlNCUU1CQTI3TElBVzQ3Wlo2S09DUUxTMkw0TCIsIlBhdGgiOiIvcHV6emxlL3ZlcmlmeSJ9; OptanonConsent=isGpcEnabled=0&datestamp=Tue+Feb+15+2022+10%3A30%3A05+GMT%2B0100+(Central+European+Standard+Time)&version=6.27.0&isIABGlobal=false&hosts=&consentId=1b8e7b19-f8bf-4716-94fa-4253f2d1fd14&interactionCount=1&landingPath=NotLandingPage&groups=C0001%3A1%2CC0002%3A0%2CC0003%3A0%2CC0004%3A0%2CSTACK42%3A0&AwaitingReconsent=false' \ --compressed
Yes, it does look like they are blocking the Amazon Lambda IPs. Have you tried using different regions?
On Tue, Feb 15, 2022 at 9:36 AM Alberto Tosato @.***> wrote:
Hi Robert, thank you for the reply. Good idea, I will try to send it from EC2 later today. Below the request.
curl 'https://www.pararius.com/apartments/rotterdam/page-1' -H 'authority: www.pararius.com' -H 'cache-control: max-age=0' -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="98", "Google Chrome";v="98"' -H 'sec-ch-ua-mobile: ?0' -H 'sec-ch-ua-platform: "Windows"' -H 'dnt: 1' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng, /;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'sec-fetch-site: same-origin' -H 'sec-fetch-mode: navigate' -H 'sec-fetch-user: ?1' -H 'sec-fetch-dest: document' -H 'referer: https://www.pararius.com/english' -H 'accept-language: en-GB,en;q=0.9,en-US;q=0.8,it;q=0.7' -H 'cookie: fl_cp_pass_a=eyJLZXkiOiIyTk0zQkJUN0tUNkpDQ0VKVVVJR1JZQkhGWlBKQk8yVyIsIlBhc3MiOiJHR1QzRlNCUU1CQTI3TElBVzQ3Wlo2S09DUUxTMkw0TCIsIlBhdGgiOiIvcHV6emxlL3ZlcmlmeSJ9; OptanonConsent=isGpcEnabled=0&datestamp=Tue+Feb+15+2022+10%3A30%3A05+GMT%2B0100+(Central+European+Standard+Time)&version=6.27.0&isIABGlobal=false&hosts=&consentId=1b8e7b19-f8bf-4716-94fa-4253f2d1fd14&interactionCount=1&landingPath=NotLandingPage&groups=C0001%3A1%2CC0002%3A0%2CC0003%3A0%2CC0004%3A0%2CSTACK42%3A0&AwaitingReconsent=false'
--compressed
— Reply to this email directly, view it on GitHub https://github.com/teticio/lambda-scraper/issues/2#issuecomment-1040054549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRPDBYRXFL32YDLJD64KCTU3INAZANCNFSM5OIPI3FA . You are receiving this because you commented.Message ID: @.***>
Hi, thanks for the nice project, I was able to create my proxy lambda functions and send requests through them.
However when trying to send a GET request to a website (e.g. pararius.com) I get systematically redirected to the "we think you are a bot" page, from the first request. I'm rotating user agent and sending reasonable headers.
From my own computer I get no captcha triggered. This makes me think that the IP from AWS immediately rises a red flag on the server side which sends me to the captcha page (from the IP address one can easily check the hostname and provider, which says the request comes from amazon servers).
Did you also experience such a thing? Do you have any work around in mind?