socialsensor / community-evolution-analysis

A framework for the analysis of social interaction networks (e.g. induced by Twitter mentions) in time.
61 stars 20 forks source link

community-evolution-analysis

A framework for the analysis of social interaction networks (e.g. induced by Twitter mentions) in time.

We make available a Twitter interaction network collector and a set of Matlab and Python scripts that enable the analysis of social interaction networks with the goal of uncovering evolving communities. More specifically, the interaction network collector forms a network between Twitter users based on the mentions in the set of monitored tweets (using the Streaming API). The network of interactions is adaptively partitioned into snapshot graphs based on the frequency of interactions. Then, each graph snapshot is partitioned into communities using the Louvain method [2]. Dynamic communities are extracted by matching the communities of the current graph snapshot to communities of previous snapshot. Finally, these dynamic communities are ranked and presented to the user in accordance to three factors; stability, persistence and community centrality. The user can browse through these communities in which the users are also ranked in accordance to their own specific snapshot centrality, browse through wordclouds containing a summary of the most frequently used terms in each dynamic community and the most frequent urls. The PageRank algorithm [3] is used to measure the feature of centrality.

[1] K. Konstandinidis, S. Papadopoulos, Y. Kompatsiaris. "Community Structure, Interaction and Evolution Analysis of Online Social Networks around Real-World Social Phenomena". In Proceedings of PCI 2013, Thessaloniki, Greece (to be presented in September).
[2] V. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre. "Fast unfolding of communities in large networks". In Journal of Statistical Mechanics: Theory and Experiment (10), P10008, 2008
[3] S. Brin and L. Page. "The anatomy of a large-scale hypertextual web search engine". Comput. Netw. ISDN Syst., 30(1-7):107{117}, Apr. 1998.

Distribution Information

This distribution contains the following:

Evolution analysis using Python

Any new data (json files) to be analysed should be placed in the ../data/json/ folder. In order for the python files to work, the data should be in a json twitter-like form (the "entities", "user", "created_at", "id_str" and "text" keys and paths should be identical with twitter's).

Code

The python code consists of 8 files containing friendly user scripts for performing Community Evolution Analysis from json and txt files acquired from the Twitter social network. The framework was implemented using Python 3.3 and the 3rd party libraries required for the framework to work are dateutil (requires pyparsing), numpy, matplotlib, PIL and networkx (http://www.lfd.uci.edu/~gohlke/pythonlibs/).

The python folder contains 8 files:

Python Results

The framework provides the user with 5 pieces of resulting data in the ../data/results/ folder: a) the useractivity.eps file which presents the user mentioning activity according to the selected sampling interval, b) usersPairs(num).txt files which can be used with the Gephi visualization software in order to view the praphs, c) the rankedcommunities variable (from main.py) which contains all the communities (and their users) which evolved ranked in accordance to the persistence, stability and community centrality triplet, d) the community size heatmap (communitySizeHeatmap.eps) which provides a visualization of the sizes of the first 100 most important communities (ranked from top to bottom), a set of wordclouds (in the ../results/wordclouds folder) collaged in a manner similar to the heatmap (separate wordclouds for every ranked dynamic community are also created in the same folder) and e) the rankedCommunities.json file which contains all the ranked communities along with all the information regarding the specific timeslot of evolution for each community, the persistence, stability and community centrality values and all the users in each community accompanied by their own centrality measure. As such, the framework can be used to discover the most important communities along with the most important users inside those communities, follow the most referenced URLs but also acquire an idea about the topics discussed via wordclouds.

Evolution analysis using Matlab

Any new data to be analysed should be placed in the ../data/ folder In the case where the user has data from a different source, in order for the python files to work, the data should either be in a json twitter-like form (the "entities", "user" and "created_at" keys and paths should be identical with twitter's) or in a txt file of the form:

user1 \TAB user2,user3... \TAB "created_at_timestamp" \TAB text \newline  

Step1: json Parsing (Python)

The python parsing code consists of 8 files containing user friendly scripts for parsing the required data from json files. There are 4 files to be used with jsons from any other Twitter API dependant source.
More specifically, they are used to create txt files which contain the mentions entries between twitter users as well as the time at which these mentions were made and the context in which they were included.

The json_mention_multifile* files provide as many txt files as there are json files. They contain all the information required from the tweet in a readable form:

user1 \TAB user2,user3... \TAB "created_at_timestamp" \TAB text \newline

The json_mention_matlab_singleFile* files provide a single file which contains only the data required to perform the community analysis efficiently. They contain information in a "Matlab friendly" form:

user1 \TAB user2 \TAB unix_timestamp \TAB \newline
user1 \TAB user3 \TAB unix_timestamp \TAB \newline

This folder contains 6 files:

The resulting authors_mentions_time.txt file should be added to the ../data/ folder.

Step2: Evolution Analysis (Matlab)

The matlab code consists of 13 files which can either work as standalone scripts, or as functions of the main.m script. The only thing that needs changing is a comment of the function line in each of the m-files (instructions are included in the m-files).
These 13 files are:

There are also 4 assistive functions which are used to extract the position of each user in the adjacency matrix (pos_aloc_of_usrs.m), to create the adjacency matrix (adj_mat_creator.m), to perform the community detection (comm_detect_louvain.m) and to extract the centrality of each user using the pagerank algorithm (mypagerank.m).

Matlab Results

The final outcome is a cell array in ../data/mats/signifComms.mat containing the most significant dynamic communities, their users and the centrality of the users. .../data/mats/commEvol.mat and .../data/mats/commEvolSize.mat are also useful as they present the evolution of the communities and the community sizes respectfully. A heatmap presenting the evolution and size of all evolving communities is also produced giving the user an idea of the bigger picture (../data/figures/).