mozilla / mozfest-program

INACTIVE - - Where we're reviewing and scheduling the Mozfest sessions.
45 stars 5 forks source link

How to run a stealth scraper farm with Docker and Tor #115

Closed mmmavis closed 8 years ago

mmmavis commented 8 years ago

[ Google Spreadsheet Row Number ] 85 [ Facilitator ] Pierre Romera


How to scrape 40+ sites in real time for a good cause, without putting any load on them and without being detected? I'll show how to use Python, Docker, EC2 and Tor to do just that. And discuss the legal and ethical implications.


Plan for a 90-min workshop (can be adjusted up/down) (1) 10 minutes : Quick description of our scraper farm and of the architecture. (2) 10 minutes: The ethical implications of scraping and the legal context. (3) Rest of the session: From a boilerplate I'll have prepared in advance, we'll build a small scraper farm to scrape the prices of medicines in different countries. I'll show how to build one scraper for France and describe the data structure we'll use, then groups by language skills will scrape the websites of countries they speak the language of. I'll prepare a list English-language websites (the likes of the UK, Ireland, India etc.) for people who do not have a team. (4) Once we have a couple of scrapers ready, I'll put them online on EC2.


The hardest thing will be to make groups that have the following skills: Language of the website to scrape, analysis of the data (e.g to code algorithm that check that they haven't scraped a test page), Python, regex, basics of UNIX CLI commands. Such skills are usually easy to come by at MozFest.

If the audience is small, we'll do just one scraper together.

If the audience has a mix of skills that prevents the creation of manageable groups, I'll introduce those who want to to code-free scrapers, such as Kimono.


If things go well, we'll have a series of scrapers running that collect the prices of drugs in several countries. Together with the participants that wish to do so, we'll polish the scrapers and let them run for a few days.

Once we have a sizable database, we'll check with NGOs in that field (Madrid's Civio recently released a similar project) if they are interested to use the data and/or take the scrapers forward.

Melechuga commented 8 years ago

cc @mixedpuppy @shaghdoosti @MozStacy Struggling with the pathway label; Stop Watching Me? Backdoors + Cryptowars?

mixedpuppy commented 8 years ago

@Melechuga I don't understand how this fits in Digital Citizenship

Melechuga commented 8 years ago

@mixedpuppy eh, I suppose I got caught up in the use of Tor and "ethical implications of scraping and the legal context". Feel free to take it out if you'd like.

Saallen commented 8 years ago

cc @erikao here also

marcwalsh-zz commented 8 years ago

We appreciated this submission but unfortunately this session does not fit within the narrative of our Space for 2015 - Hope you will submit another session for 2016.