Request (or pointer how to build) - community algorithm using overlaps in graph

neo4j-contrib / neo4j-graph-algorithms

Efficient Graph Algorithms for Neo4j

https://github.com/neo4j/graph-data-science/

GNU General Public License v3.0

772 stars 194 forks source link

Request (or pointer how to build) - community algorithm using overlaps in graph #484

Open AndreSteenbergen opened 7 years ago

AndreSteenbergen commented 7 years ago

I would like to perform tasks on a webgraphs of websites I spider. There are however some issues with the current community detection algorithms. On most websites you can find structures quite easily as a human.

But in websites you will encounter strongly connected graphs everywhere, because all pages will link to a certain number of pages based on the design of a particular page. I would like to implement the RaRe algorithm described in the following document.

https://pdfs.semanticscholar.org/b57b/0b8f6483ba4f4f49ea6ee3c12f0f1062b74a.pdf

I haven't build neo4j procedures so I would't know where to start. Can anyone assist me in building such an algorithm, or provide pointers where to start?

My goal is to find the categories and/ or subcategories in websites. e.g. when I crawl a crafts websites I would see woodworking, knitting, painting etc. The community algorithms is this library all end up in one very large community and a small number very small communities.

jexp commented 7 years ago

Hi André, do you have any background in Java development?

If so, I could give you pointers where to look to familiarize yourself with neo4j-procedures (basically annotated methods)

but more so the implementation approach for neo4j-graph-algorithms (using low level APIs for parallel loading into an in-memory graph structure, building the algorithm on top of the Graph interfaces and writing results back in parallel).

In general you might be able to achieve what you want to do with cypher projections. By iterating over the categories, and having that drive running the categorical community detection.

AndreSteenbergen commented 7 years ago

No Java background, I have a c# background. The concept is the same and I have done some android programming (in Java). I should be able to pick it up again.

It's a relative large project to begin working in without any pointers where to start. I have tried something in cypher, but running for each loops and working with sets in cypher isn't all that pleasant, and not particularly fast. The described algorithm was also described in a lecture cs I found with promising results for web graphs.

jexp commented 7 years ago

If you want to we can have a call and we can look at the exisiting code.

The infrastructure for an algorithm is actually not big:

a procedure with user facing API
that loads the relevant data
initializes and runs the algorithm
the algorithm uses the graph api to access the relevant information efficiently
returns the data to the user or stores it in the graph

Can you provide the algorithm as a simple pseudocode or sketch so we could discuss how to implement it?

AndreSteenbergen commented 7 years ago

Yes I would like that. Somewhere next week, if that is possible

The algorithm is quite simple:

Initialize Set R with empty set
Create Set called ClusterCores
Create SCC
foreach H in SCC do
  ClusterComponent(H)

foreach node in R
  foreach cluster in clustercores
    add node to cluster if Weight of cluster with node > Weight of cluster without node

Function ClusterComponent(H)
  if (count(nodes) in cluster > max)
    remove highest ranking node in cluster H and add to R (using PageRank)
    foreach F in SCC(H)
       ClusterComponent(F)
  else
    add h to ClusterCores