volentixlabsinc / venue-server

The backend services for Venue, a community engagement platform for the Volentix community
https://venue.volentix.io
MIT License
6 stars 0 forks source link

Setup browser-based scraping as fallback #210

Open joemarct opened 5 years ago

joemarct commented 5 years ago

Crawlera only provides proxies and that will only serve us its purpose if scraping fails because bitcointalk blocked our IP. If bitcointalk uses cloudflare to detect and block scraper bots (which I have personally seen a few times), Crawlera proxies cannot help us.

Crawlera provides new proxy IPs on demand, it's still up to us to use these proxies to send direct HTTP requests or use them behind a browser.

In the spirit of designing a more resilient scraping system, I propose that we use these proxies behind a browser using selenium. That way, our fallback circumvents both IP blocking and Cloudflare's scraper bot blocking.

Here's a guide on how to use Cloudflare with Selenium: https://support.scrapinghub.com/support/solutions/articles/22000203564-using-crawlera-with-selenium-and-polipo

shawnlauzon commented 5 years ago

@alexdashkov You suggested Crawlera, can you provide feedback on this issue?

alexdashkov commented 5 years ago

I use crawlera a lot and didn't see the blocks described in this issue, however, I used it using scrapy (with retry policy and user-agent rotation).

@shawnlauzon can you check crawlera stats? Do you have a lot of bad requests there?

Of course, we can set up selenium, but I'm not sure that it's something that is required right now.

shawnlauzon commented 5 years ago

I cannot check it right now. You should have access to the system, could you check it?

On Sat., Aug. 25, 2018, 4:54 p.m. Oleksandr, notifications@github.com wrote:

I use crawlera a lot and didn't see the blocks described in this issue, however, I used it using scrapy (with retry policy and user-agent rotation).

@shawnlauzon https://github.com/shawnlauzon can you check crawlera stats? Do you have a lot of bad requests there?

Of course, we can set up selenium, but I'm not sure that it's something that is required right now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Volentix/venue-server/issues/210#issuecomment-415995920, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlUyagXfAgnWy555gL54a-getG71n1zks5uUbl6gaJpZM4WMPMN .

alexdashkov commented 5 years ago

So right now I don't see any request in the crawlera.

shawnlauzon commented 5 years ago

As of today, I do see requests

image

alexdashkov commented 5 years ago

It goes from my local tests

shawnlauzon commented 5 years ago

Oh, you don't see any requests from the server, I understand now.

I just added CRAWLERA_TOKEN to the config.

So @alexdashkov you're saying that rather than running Selenium, we could use Scrapy (provided by Scrapinghub) to do the scrapes. Is that correct?

alexdashkov commented 5 years ago

No, I didn’t mean that this project should use scrapy. I meant that I didn’t see the problems described in this issue.

On Sun 26 Aug 2018 at 01:24, Shawn Lauzon notifications@github.com wrote:

Oh, you don't see any requests from the server, I understand now.

I just added CRAWLERA_TOKEN to the config.

So @alexdashkov https://github.com/alexdashkov you're saying that rather than running Selenium, we could use Scrapy (provided by Scrapinghub) to do the scrapes. Is that correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Volentix/venue-server/issues/210#issuecomment-416002810, or mute the thread https://github.com/notifications/unsubscribe-auth/AGU3DP02CdBXu-tqavKLxJIRgDmTg5H9ks5uUdyygaJpZM4WMPMN .

shawnlauzon commented 5 years ago

Summary of this seems to be that we don't need to implement this unless we see a problem with the existing fallback solution of purely crawlera. Moving to Backlog