sjyk / sampleclean-async

http://sampleclean.org
Apache License 2.0
92 stars 27 forks source link

Optimizations for the demo, you can now load the data, clean the data, and run a query #13

Closed sjyk closed 10 years ago

sjyk commented 10 years ago

Basic Steps: (1) get parsed datasets from Sanjay (2) sampleclean> load demo (3) dedupAttr

(4) selectrawsc count(1) from paper_aff_sample group by affiliation (null,(95200.0,6.224245907723261)) ( Microsoft,(8500.0,0.5919010185835524)) ( NBER Contributor,(6900.0,0.48102612861572625)) ( Microsoft Research,(3600.0,0.2515531482236896)) ( Beihang University (Beijing University of Aeronautics and Astronautics),(3200.0,0.22366561260700343)) ( Stanford University,(2900.0,0.20273965544210695)) ( Tsinghua University China,(2700.0,0.18878410997389897)) ( Seoul National University,(2400.0,0.16784343073417143)) ( University of California Los Angeles,(2300.0,0.16086124137761817)) ( Tohoku University,(2100.0,0.14689391824954529)) (5) selectnsc count(1) from paper_aff_sample group by affiliation ( University of California Los Angeles,(500.0,3.379943013302008)) ( Department of Computer Science and Software Engineering|University of Melbourne,(500.0,3.379943013302008)) ( Institute for Theoretical and Experimental Physics Moscow Russia,(500.0,3.379943013302008)) ( Department of Physics and Astronomy|University of Glasgow,(400.0,2.7235838437497124)) ( Department of Computing Science University of Alberta Edmonton Alberta T6G 2E8 Canada,(400.0,2.7235838437497124)) ( Department of Computer Science|University of Minnesota,(400.0,2.7235838437497124)) ( University of Queensland,(400.0,2.7235838437497124)) ( Massachusetts Institute of Technology Cambridge MA 02139 USA,(400.0,2.7235838437497124)) ( Division of Hepatology Department of Medicine University of Miami School of Medicine,(400.0,2.7235838437497124)) ( University of California San Diego,(400.0,2.7235838437497124)) //Its too selective to do things by conference, eg. sigmod (6)sampleclean> selectrawsc count(1) from paper_sample join paper_aff_sample on paper_sample.id = paper_aff_sample.paperid where paper_sample.conference = 370 group by paper_aff_sample.affiliation ( Microsoft Research,(100.0,0.006036174345567835)) ( Computer Sciences Department|University of Wisconsin-Madison,(100.0,0.00587302517501822)) ....