Basic Steps:
(1) get parsed datasets from Sanjay
(2) sampleclean> load demo
(3) dedupAttr
(4) selectrawsc count(1) from paper_aff_sample group by affiliation
(null,(95200.0,6.224245907723261))
( Microsoft,(8500.0,0.5919010185835524))
( NBER Contributor,(6900.0,0.48102612861572625))
( Microsoft Research,(3600.0,0.2515531482236896))
( Beihang University (Beijing University of Aeronautics and Astronautics),(3200.0,0.22366561260700343))
( Stanford University,(2900.0,0.20273965544210695))
( Tsinghua University China,(2700.0,0.18878410997389897))
( Seoul National University,(2400.0,0.16784343073417143))
( University of California Los Angeles,(2300.0,0.16086124137761817))
( Tohoku University,(2100.0,0.14689391824954529))
(5) selectnsc count(1) from paper_aff_sample group by affiliation
( University of California Los Angeles,(500.0,3.379943013302008))
( Department of Computer Science and Software Engineering|University of Melbourne,(500.0,3.379943013302008))
( Institute for Theoretical and Experimental Physics Moscow Russia,(500.0,3.379943013302008))
( Department of Physics and Astronomy|University of Glasgow,(400.0,2.7235838437497124))
( Department of Computing Science University of Alberta Edmonton Alberta T6G 2E8 Canada,(400.0,2.7235838437497124))
( Department of Computer Science|University of Minnesota,(400.0,2.7235838437497124))
( University of Queensland,(400.0,2.7235838437497124))
( Massachusetts Institute of Technology Cambridge MA 02139 USA,(400.0,2.7235838437497124))
( Division of Hepatology Department of Medicine University of Miami School of Medicine,(400.0,2.7235838437497124))
( University of California San Diego,(400.0,2.7235838437497124))
//Its too selective to do things by conference, eg. sigmod
(6)sampleclean> selectrawsc count(1) from paper_sample join paper_aff_sample on paper_sample.id = paper_aff_sample.paperid where paper_sample.conference = 370 group by paper_aff_sample.affiliation
( Microsoft Research,(100.0,0.006036174345567835))
( Computer Sciences Department|University of Wisconsin-Madison,(100.0,0.00587302517501822))
....
Basic Steps: (1) get parsed datasets from Sanjay (2) sampleclean> load demo (3) dedupAttr