paulhoule / telepath

System for mining Wikipedia Usage data to read our collective mind
MIT License
21 stars 1 forks source link

:SubjectiveEye 3D #3

Open paulhoule opened 10 years ago

paulhoule commented 10 years ago

For this we divide the monthlies by the normalization factors for the project*month and then we normalize each project so that it adds up to, let's say, 5.

The issue is that these probabilities don't add up to 1 for several reasons. For one thing, people think about all kinds of things that aren't in Wikipedia, so a large amount of importance is not represented, that is, for some purposes, the probabilities might not add up to one.

An alternative issue is that often people are thinking about more than one thing at a time, you might be thinking about the relationship between A and B or you are thinking about 'Rhode Island' so you're also thinking about the `United States'. If you're thinking about "Electric Funeral" you're thinking about "Black Sabbath" and "Ozzy Osborne", etc.

I get 5 from the idea that the human cognitive capacity is

http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_or_Minus_Two

The moral of it is that if you want to use this data as a prior distribution you'll need to tune the normalization for your data set.

paulhoule commented 10 years ago

Ok, here is the command used for the first tiny successful tiny run

haruhi run job -clusterId tinyAwsCluster -jarId telepath projectNormalized3D -input s3n://wikimedia-pagecounts/2008/2008-01/pagecounts-20080101-000000.gz -factors s3n://wikimedia-summary/monthly-sites.txt.gz -output s3n://wikimedia-summary/test/firstNormalized -R 1

let's do a bigger run

haruhi run job -clusterId smallAwsCluster -jarId telepath projectNormalized3D -input s3n://wikimedia-pagecounts/2008/2008-01/ -factors s3n://wikimedia-summary/monthly-sites.txt.gz -output s3n://wikimedia-summary/test/secondNormalized -R 4
paulhoule commented 10 years ago

We can run against already summarized data for speedup and cleanup

haruhi run job -clusterId smallAwsCluster -jarId telepath projectNormalized3D -input s3n://wikimedia-summary/monthlyAll/2008-01 -factors s3n://wikimedia-summary/monthly-sites.txt.gz -output s3n://wikimedia-summary/test/summaryNormalized -R 4

the trick now is to get it to run over all months,

haruhi run job -clusterId smallAwsCluster -jarId telepath projectNormalized3D -dir s3n://wikimedia-summary/monthlyAll/ -input 2008-01,2008-02,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,2008-10,2008-11,2008-12,2009-01,2009-02,2009-03,2009-04,2009-05,2009-06,2009-07,2009-08,2009-09,2009-10,2009-11,2009-12,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12 -factors s3n://wikimedia-summary/monthly-sites.txt.gz -output s3n://wikimedia-summary/test/summaryNormalized -R 4
paulhoule commented 10 years ago

Ok, found out that the 2013-03 file is missing, I should get it back but I am impatient to see the whole, so I do

haruhi run job -clusterId largeAwsCluster -jarId telepath projectNormalized3D -dir s3n://wikimedia-summary/monthlyAll/ -input 2008-01,2008-02,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,2008-10,2008-11,2008-12,2009-01,2009-02,2009-03,2009-04,2009-05,2009-06,2009-07,2009-08,2009-09,2009-10,2009-11,2009-12,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12 -factors s3n://wikimedia-summary/monthly-sites.txt.gz -output s3n://wikimedia-summary/3dNormalizedRaw -R 23