twitter / cassovary

Cassovary is a simple big graph processing library for the JVM
http://twitter.com/cassovary
Apache License 2.0
1.05k stars 150 forks source link

[Bug?] PageRank sum is much less 1 #186

Closed alexloginov closed 9 years ago

alexloginov commented 9 years ago

Hi Everybody!

I'm trying to calculate PageRank for a graph toy_7nodes_adj_StringIds.txt using cassovary-core_2.11:5.0.0

My code is pretty simple:

package com.datanovo.pagerank

import com.twitter.cassovary.algorithms.{PageRankParams, PageRank}
import com.twitter.cassovary.util.SequentialNodeNumberer
import com.twitter.cassovary.util.io.{AdjacencyListGraphReader}

object FailingPR {
  def main(args: Array[String]): Unit = {
    val numberer = new SequentialNodeNumberer[String]()
    val graph = new AdjacencyListGraphReader[String](
      ".",
      "toy_7nodes_adj_StringIds.txt",
      numberer,
      idReader = identity
    ).toArrayBasedDirectedGraph()

    val result = PageRank(graph, PageRankParams(dampingFactor = 0.85))
    println (result.sum)
  }
}

But I was very surprised with results:

0.2810312946428572

So it is not 1 in total. After this i decided to print out ranks for each node separately:

a:0.03266071428571429 b:0.06042232142857144 c:0.047108058035714294 d:0.047108058035714294 e:0.021428571428571432 f:0.039642857142857146 g:0.03266071428571429

Than, i've tried next graph:

a 3 b c d

but result is still about 0.2.

Graph like:

a 1 b b 1 c c 1 a

Work perfectly and give a sum ~1 Also, I tried to generate RandomGraph and run PageRank, total sum of ranks about 1 as well. But with real graph it just not working (or may be i'm doing something wrong).

Could you please assist me?

bmckown commented 9 years ago

I wrote the implementation for pagerank. @pankajgupta I will take a look at this.

alexloginov commented 9 years ago

Note: I've tried to override GraphReader and put StoredGraphDir.BothInOut (assumed, that problem could be in missing InboundNodes references), but it didn't help.

bmckown commented 9 years ago

A quick note: release 5.1.0 has a new implementation of all link analysis algorithms including PageRank and Hits. However, when running the above graph on the new version, there is a divide by zero error occurring yielding infinity for nodes with no outbound neighbors. Good catch @alexloginov. I will fix the implementation and submit a patch.

alexloginov commented 9 years ago

@bmckown Thank you for a very quick response. I will be glad to get this patch!!!

pankajgupta commented 9 years ago

Looking forward to the PR, @bmckown :)

On Wed, Jun 17, 2015 at 7:53 AM, alexloginov notifications@github.com wrote:

@bmckown https://github.com/bmckown Thank you for a very quick response. I will be glad to get this patch!!!

— Reply to this email directly or view it on GitHub https://github.com/twitter/cassovary/issues/186#issuecomment-112830415.

pankajgupta commented 9 years ago

Can you verify post merge, it works for you @alexloginov ?

alexloginov commented 9 years ago

@bmckown, @pankajgupta Verified on 5 different small graph. Everything works correctly. Today i'm going to test it against real-world data (52M edges, 5M nodes) and will write results.

alexloginov commented 9 years ago

@bmckown, @pankajgupta Issue fixed. Checked on pretty big graph (52M edges, 5M nodes) - everything works like a charm!

Very good work guys, i'm very appreciated!

@pankajgupta when fixed version will be available in TWTR maven or maven-central? Right now i'm using self-packaged version, but hold 23M jar file as unmanaged dependency is a bit messy way.

Thank you, again! Alexander

pankajgupta commented 9 years ago

@alexloginov I have released 5.1.3 to Maven -- might take a day to appear in maven central.

pankajgupta commented 9 years ago

It is now in maven central. Closing this bug.

alexloginov commented 9 years ago

@pankajgupta Awesome! Thanks you very much!