Open infiBoy opened 7 years ago
Specifics parts :
Among the good practices that have emerged over time to improve the quality of the projects is the most used methods are the SEMMA and the CRISP-DM which is the one most used in the years 2010.
For the algorithms, there are two methods for doing data analysis, the predictive method and the descriptive method, which are even overlapping in the sub-method:
Descriptive methods: They make it possible to simplify, organize and help understand the information of a large set of data.
Techniques derived from statistics can be exploited. The factorial analyzes such as the analysis of principal components, independent components, multidimensional positioning or analysis of multiple correspondences are the most common.
We can also employ the nee methods in the wake of artificial intelligence as automatic learning.
Predictive methods: Their primary purpose is to predict or explain one or more observable and effectively measured phenomena.
They look for one or more variables as a target to be defined as the targets of the analysis. In a predictive data exploration, there are two types of operations, ranking, which is concerned with qualitative variables and regression or prediction, which is concerned with continuous variables. They make it possible to separate individuals into several classes, supervised or unsupervised. A quality model is a fast model with the lowest error rate. Several indicators are used to evaluate the quality of a model, among which the ROC and Lift curves, the Gini index and the mean square error show the prediction in relation to reality.
Computer tools: In 2009, SPSS, RapidMiner, SAS, Excel, R, KXEN, Weka, Matlab, KNIME, Microsoft SQL Server, Oracle DM, STATISTICA and CORICO are the most widely used tools. In 2010, R is the most widely used tool. Today, the computer tool is used in cloud, type oracle data mining on IaaS of amazon.
Reading tasks: Major source- http://dmml.asu.edu/smm/SMM.pdf
Specific parts:
If unfamiliar with Graph or Data Science basic's - First learn essentials (chapter 2 +chapter 5) , otherwise read as needed.
Network (I,3) 2.Community analysis (II,6)
Influence measuring ( III , 8)
Behavioral analytics (III,10) 5.Information diffusion (II,7)
Make a summary for each module ,specific emphasize the algorithms that seems to be the most practicals.
Hands on practice (when finished each task, make a pull request):
Create facebook account + write/use the script to login ....-Implement(or find) all the follows operations: Follow, Post, Like, Share, MakeFriendRequest, make a query on facebookGraph.
Create an monitoring and logger that logs everything that the bot is doing (the time he posted, the post itself, statistics about the tweet) - it's need to be cross-social-platform!
On Twitter: -Detect a network with a certain property (using the algorithms that had mentioned) -Draw the network and theirs connections (using a graphdb / networkX library) ...-In that network detect the most influences accounts.