joelgrus / data-science-from-scratch

code for Data Science From Scratch book
MIT License
8.63k stars 4.5k forks source link

Normalisation scheme in simplified PageRank algorithm #76

Open phonosync opened 5 years ago

phonosync commented 5 years ago

Hi Joel,

from my point of view your implementation of the simplified PageRank algorithm does not follow the protocol outlined in the book. I only have the first edition at hand, where it says:

  1. There is a total of 1.0 PageRank in the network. this should be true even at the end of the calculation, but is violated by your implementation. The total PageRank at the end of the calculation in your script amounts to 1.0425

  2. Initially this PageRank is equally distributed among nodes.

  3. At each step, a large fraction of each node's PageRank is distributed evenly among its outgoing links.

  4. At each step, the remainder of each node's PageRank is distributed evenly among all nodes. This point is missing in your implementation

I am proposing a version which implements point 4 in a straight forward way. There might be more elegant ways, but this one is easy to understand. The results get numerically very close to the implementation in networkX. See numbers below.

Original results user id: PageRank 0: 0.1, 1: 0.1, 2: 0.1, 3: 0.1, 4: 0.14250000000000002, 5: 0.1, 6: 0.1, 7: 0.1, 8: 0.1, 9: 0.1

NetworkX results import networkx as nx G = nx.DiGraph() G.add_nodes_from([user.id for user in users]) G.add_edges_from(endorsements) pr_nx=nx.pagerank(G, 0.85)

user id: PageRank 0: 0.09499151348469306, 1: 0.10547758964858775, 2: 0.10547758964858775, 3: 0.09499151348469306, 4: 0.1593177423515437, 5: 0.10200959185661473, 6: 0.07857495588955458, 7: 0.07857495588955458, 8: 0.10200959185661472, 9: 0.07857495588955458

New results 0: 0.0949906958425375, 1: 0.10547659652084887, 2: 0.10547659652084887, 3: 0.0949906958425375, 4: 0.1593168333463994, 5: 0.10201123958329422, 6: 0.07857536758674652, 7: 0.07857536758674652, 8: 0.10201123958329422, 9: 0.07857536758674652