iit-cs579 / main

CS579: Online Social Network Analysis at the Illinois Institute of Technology
147 stars 204 forks source link

cluster.py question: keep data collected in our own trial checked in to GitHub? #384

Closed davidghiurco closed 7 years ago

davidghiurco commented 7 years ago

I'm using TwitterAPI and due to rate limitations on REST calls, I make 1 request for 5000 tweets, and then I have 14 left over to gather friend information for some randomly chosen users from those tweets.

14 users makes for some awfully uninteresting communities. My script can very well handle making more requests, but it will sleep for a long time. As such, in order to have some kinda of intelligent discussion about results in the summary of the assignment should we leave the data that we collect over a couple of hours checked into our repository?

Right now I have it set up that upon running collect.py, all data is wiped and things start fresh as far as collected data.

aronwc commented 7 years ago

You can leave data checked in (assuming it's not too large; say < 50Mb). But, it's okay if your code sleeps to make more requests.

davidghiurco commented 7 years ago

So then I have a follow-up question.

This is the original graph that I'm trying to detect a community on. It depicts users, as well as their followers AND their friends (in this example equal number of requests to 'followers/ids' were made as to 'friends/ids' .... I'm experimenting with changing those parameters around as well)

image

What often happens in my community detection is that a user is a bridge between 2 huge clusters, and so my algorithm will remove the 2 edges surrounding that user (using girvan newman edge betweenness centrality) and call that a new cluster and because my girvan newman algorithm does this:

components = [component for component in nx.connected_component_subgraphs(graph_copy)]
initial_num_components = len(components)
while len(components) == initial_num_components:
    max_edge = get_max_edge()
    print('removing edge: ' + str(max_edge))
    graph_copy.remove_edge(*max_edge)
    components = [component for component in nx.connected_component_subgraphs(graph_copy)]

once that bridge user has its 2 edges removed, there's a 3rd component, and the loop terminates and gives some awfully uninteresting communities.

I have yet to experiment with huge amounts of data (will have to leave collection script running over night, but right now I'm doing with data I can get with 15 calls)

Is this normal/ok to happen? I'm kind of "scared" of 1 person communities.

vinaykai commented 7 years ago

I don't know if this may be valid or not but maybe you can remove the nodes that have only degree 2? But I guess then you will automatically get a clustered graph, so I guess my suggestion might be wrong. How about connecting friends of friends?

davidghiurco commented 7 years ago

Well first of all, I'm using edge betweenness centrality so I'm not removing any nodes. I'm removing edges according to highest betweenness.

Second, I'm already removing edges of nodes of degree 2 or higher. The problem is when the node has degree equal 2 means it is a bridge node (there are quite a few in the graph above) and so remove its 2 edges and you get more components, one of which is just 1 node, the bridge node.

davidghiurco commented 7 years ago

professor answered inquiry in class. Closing