论文part4:基于hadoop的聚类分析

wangzhenhui1991 commented 7 years ago

Large scale cluster analysis with Hadoop and Mahout.pdf

DataSet

Finally, the data to be processed consists of two data sets. One compiled from the music database Last.FM. The data set was created in 2007 and consists of approximately 20 000 unique artists tagged with 100 000 unique tags (the total tag count is roughly 7.1 million). [8] The second data set is from Tumblr and consists of a snapshot of the activity across the site for a continuous period of time. The data set is large in both dimensionality, approximately 40 million unique tags, and cardinality, approxima

wangzhenhui1991 commented 7 years ago

关于Hadoop与Matlab 的性能分析

关于这点，matlab的矩阵计算使用的是Intel自己出的Math Kernel Library(MKL),是在汇编级别上优化，C快在循环，这个库远比其他blas/lapack库要快。

Hadoop只是的应用点在并行计算，而且是面对大数据，旨在解决硬盘读写性能上的瓶颈
而Matlab更适合在验证一些本地算法性能，解决的是cpu性能效率方面的问题。

wangzhenhui1991 commented 7 years ago

对mahout实现K-Means算法详细的分析 Mahout学习——K-Means Clustering

wangzhenhui1991 commented 7 years ago

fcm on MapReduce: http://ai2-s2-pdfs.s3.amazonaws.com/936d/1fc30c82db64ea06a80a2c17b635299b7a48.pdf

wangzhenhui1991 / Notes

论文part4:基于hadoop的聚类分析 #7

DataSet

关于Hadoop与Matlab 的性能分析