Open joemarct opened 5 years ago
@alexdashkov You suggested Crawlera, can you provide feedback on this issue?
I use crawlera a lot and didn't see the blocks described in this issue, however, I used it using scrapy (with retry policy and user-agent rotation).
@shawnlauzon can you check crawlera stats? Do you have a lot of bad requests there?
Of course, we can set up selenium, but I'm not sure that it's something that is required right now.
I cannot check it right now. You should have access to the system, could you check it?
On Sat., Aug. 25, 2018, 4:54 p.m. Oleksandr, notifications@github.com wrote:
I use crawlera a lot and didn't see the blocks described in this issue, however, I used it using scrapy (with retry policy and user-agent rotation).
@shawnlauzon https://github.com/shawnlauzon can you check crawlera stats? Do you have a lot of bad requests there?
Of course, we can set up selenium, but I'm not sure that it's something that is required right now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Volentix/venue-server/issues/210#issuecomment-415995920, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlUyagXfAgnWy555gL54a-getG71n1zks5uUbl6gaJpZM4WMPMN .
So right now I don't see any request in the crawlera.
As of today, I do see requests
It goes from my local tests
Oh, you don't see any requests from the server, I understand now.
I just added CRAWLERA_TOKEN
to the config.
So @alexdashkov you're saying that rather than running Selenium, we could use Scrapy (provided by Scrapinghub) to do the scrapes. Is that correct?
No, I didn’t mean that this project should use scrapy. I meant that I didn’t see the problems described in this issue.
On Sun 26 Aug 2018 at 01:24, Shawn Lauzon notifications@github.com wrote:
Oh, you don't see any requests from the server, I understand now.
I just added CRAWLERA_TOKEN to the config.
So @alexdashkov https://github.com/alexdashkov you're saying that rather than running Selenium, we could use Scrapy (provided by Scrapinghub) to do the scrapes. Is that correct?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Volentix/venue-server/issues/210#issuecomment-416002810, or mute the thread https://github.com/notifications/unsubscribe-auth/AGU3DP02CdBXu-tqavKLxJIRgDmTg5H9ks5uUdyygaJpZM4WMPMN .
Summary of this seems to be that we don't need to implement this unless we see a problem with the existing fallback solution of purely crawlera. Moving to Backlog
Crawlera only provides proxies and that will only serve us its purpose if scraping fails because bitcointalk blocked our IP. If bitcointalk uses cloudflare to detect and block scraper bots (which I have personally seen a few times), Crawlera proxies cannot help us.
Crawlera provides new proxy IPs on demand, it's still up to us to use these proxies to send direct HTTP requests or use them behind a browser.
In the spirit of designing a more resilient scraping system, I propose that we use these proxies behind a browser using selenium. That way, our fallback circumvents both IP blocking and Cloudflare's scraper bot blocking.
Here's a guide on how to use Cloudflare with Selenium: https://support.scrapinghub.com/support/solutions/articles/22000203564-using-crawlera-with-selenium-and-polipo