Setting up a crawler to list all MediaWiki wikis in the web

trappedinspacetime / wikiteam

Automatically exported from code.google.com/p/wikiteam

0 stars 0 forks source link

Setting up a crawler to list all MediaWiki wikis in the web #59

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

To grow our wikiteam collection of wikis, I have to increase our list of wikis. 
To archive our first 4500 wikis, we've used Andrew Pavlo's list. Now I want to 
adapt his crawling framework (see source linked at 
http://www.cs.brown.edu/~pavlo/mediawiki/ , and its README) to have a more up 
to date and complete list. I created my settings.py, used pip to install django 
1.2, installed MySQL-python from my repositories, replaced httplib2 with 
httplib... and finally got stuck with MySQL errors. Unless someone else runs it 
for me I need something simpler, most of the features in the original graffiti 
etc. are excessive and in particular there's no reason why I should need a 
database, I'd like to modify/strip it and get a self-contained version just to 
make a list of domains running MediaWiki...

Original issue reported on code.google.com by nemow...@gmail.com on 4 May 2013 at 10:28

GoogleCodeExporter commented 8 years ago

I checked out that code, and I didn't see django required anywhere. I think I 
was looking at the wrong thing?  Would you mind forking that into a VCS 
somewhere online, or pointing me directly to the root and I'll put it up on 
github?

Original comment by seth.woo...@gmail.com on 12 Aug 2013 at 3:31

GoogleCodeExporter commented 8 years ago

Thanks for looking. It's not my code or repo; I just followed 
http://graffiti.cs.brown.edu/svn/graffiti/README

Original comment by nemow...@gmail.com on 12 Aug 2013 at 5:03

GoogleCodeExporter commented 8 years ago

Original comment by nemow...@gmail.com on 25 Oct 2013 at 12:45

Added labels: Priority-High, Type-Enhancement
Removed labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 8 years ago

I made something with ruby mechanize: https://gist.github.com/nemobis/7718061
I've learnt a lot making it (euphemism for "crashed my head") but I'm not sure 
it will be useful because the search results actually returned by Google (or 
Yahoo) are many less than "promised".
For instance, searching for "Magnus Manske, Brion Vibber, Lee Daniel Crocker" 
should be a rather reliable way to find only one page per site 
(Special:Version) and gives an estimate of 25k results, but actually returns 
around a hundred.

Original comment by nemow...@gmail.com on 1 Dec 2013 at 12:01

GoogleCodeExporter commented 8 years ago

Original comment by nemow...@gmail.com on 31 Jan 2014 at 3:27

Changed state: Started