sourcecred / sourcecred

a social algorithm for computing cred
https://sourcecred.io
495 stars 117 forks source link

Find an efficient way to represent PageRank results #1004

Closed teamdandelion closed 4 years ago

teamdandelion commented 5 years ago

Currently, we represent the results of PageRank via the PagerankNodeDecomposition. The NodeDecomposition is a rich data type in that it allows us to answer the following questions:

Thus, the PagerankNodeDecomposition has all the data needed to run the cred analysis workflow of the explorer. (Although the explorer wants some additional plugin-specific metadata so that it can display node names prettily.)

Unfortunately, the PagerankNodeDecomposition is quite large. Now that we want to serialize these results to disk, we should find a better representation, lest we get blocked by the 100MB file size limit on GitHub pages.

Size of a PagerankNodeDecomposition

Why is the PagerankNodeDecomposition so large? Let's analyze it.

Let n be the number of nodes in the graph, e be the number of edges, A be the size of an address (for convenience we assume that node and edge addresses are the same length), F be the size of a float, and K be the size of a string constant ("SYNTHETIC_LOOP", "IN_EDGE", "OUT_EDGE").

Then the size of a PagerankNodeDecomposition is:

n * (K + 3F + A) + e * (2K + 8A + 4F)

Addresses are often quite long, so this expression is likely dominated by n*A + e*8A

(Derivation) At a top level, we have: ``` export type PagerankNodeDecomposition = Map< NodeAddressT, {| +score: number, // Connections are sorted by `adjacencyScore` descending, // breaking ties in a deterministic (but unspecified) order. +scoredConnections: $ReadOnlyArray, |} >; ``` The size of a scoredConnection depends on whether it is a connection that came from an edge, or a connection that came from a synthetic self loop. However, every scoredConnection appears only once in the pagerankNodeDecomposition. And there is exactly one synthetic loop connection per node, and two edge connections per edge. So at a top level, the size is: `n*F + n * (loopConnection) + 2e * (edgeConnection)` Now we analyze the connections: ``` export type ScoredConnection = {| +connection: Connection, +source: NodeAddressT, +connectionScore: number, |}; export type Connection = {| +adjacency: Adjacency, // This `weight` is a conditional probability: given that you're at // the source of this connection's adjacency, what's the // probability that you travel along this connection to the target? +weight: Probability, |}; ``` So the size of the connection is always `A+2F+(adjacency)` ``` export type Adjacency = | {|+type: "SYNTHETIC_LOOP"|} | {|+type: "IN_EDGE", +edge: Edge|} | {|+type: "OUT_EDGE", +edge: Edge|}; export type Edge = {| +address: EdgeAddressT, +src: NodeAddressT, +dst: NodeAddressT, |}; ``` Then the loop adjacency has size `k` and the edge adjacency has size `K + 3A`. Putting it back together, every edge connection has size `4A+2F+K` and we have two edge connections per edge, which is how we get the `8eA` term that is really awful in the size of the decomposition. (End Derivation)

Size of a Graph

This is very inefficient compared to how we represent graphs.

When we store a graph, we store nA + e(A+2F). We're able to get this performance because we are careful to only serialize each address once, and thereafter we reference it by index.

Optimal Storage

Theoretically speaking, to encode the information in the PagerankNodeDecomposition, assuming that we already have the graph, we only need to store three kinds of information:

Thus, we should be able to store equivalent data in only n(A+F) + e(A+4F) + F

(Note: this excludes additional display metadata for the cred explorer, i.e. we are not storing the human-readable names for each node.)

However: How to actually achieve this?

Less-Optimal But Convenient Implementation

A simple approach would be to store the extra data that we need in maps alongside the graph:

export type EdgeWeight = {|
  +toWeight: number, // weight from src to dst
  +froWeight: number, // weight from dst to src
|};

export type CredAnalysisResult = {|
  +graph: Graph,
  +selfLoopWeight: number,
  +edgeWeights: Map<EdgeAddressT, EdgeWeight>,
  +nodeScores: Map<NodeAddressT, number>,
|}

This does not leverage the integer indexing trick (except for in the graph), so we pay F + nA + e(A+2F) + e(A+F) + n(A+F) for a total of F + 2nA + nF + 2eA + 3eF or an extra nA + eA compared to optimal.

Comparison of Approaches

If we use the pND, then we have to store n+8e addresses. If we implement a simple improvement, we store 2n + 2e addresses. If we implement the optimal approach, then we store n+e addresses.

Given this information, I see 4 major courses of action:

  1. Implement optimal serialization from scratch.

We write entirely new code for serializing CredAnalysisResult as defined above, and reimplement the integer indexing trick. This has the advantage that it is local, but duplicating the integer serialization logic (and, implicitly, coming up with a whole new serialized format for Graph) is very unappealing.

  1. Peek into Graph's serialized format.

We could have a method that calls toJSON on the Graph, and then iterates over the index-order nodes and edges of the serialized, using that to produce an ordered array of node scores and edge weights. Then we can do the reverse on deserialization.

This has the advantage that it involves re-implementing less logic, is straightforward to implement. However, depending on Graph's serialization format seems weird. It means that the serialized format becomes a part of Graph's public API. Right now this is not the case (GraphJSON is an opaque type).

  1. Modify Graph to allow payloads.

We could make Graph parameterizable by Node and Edge payload types. In this case, the node payload is a number, and the edge payload is an EdgeWeight. Then the Graph serialization logic can efficiently store the paylods. Then the CredAnalysisResult type above could be declared as:

type CredAnalysiResult = {|
  +scoredWeightedGraph: Graph<number, EdgeWeight>,
  +selfLoopWeight: number,
|}

This has the advantage that it doesn't re-implement graph's serialization, or subvert it. Also, once we have this, we may find other uses for it.

It has some disadvantages too:

  1. Don't worry about optimality

We can just implement the CredAnalysisResult with the maps alongside the graph. It's inefficient compared to optimal, but much better than using the PagerankNodeDecomposition. And if the 100MB size limit becomes an issue, we could further navigate around that by splitting it into separate files (for each top-level field) before needing to implement any complicated or hard-to-maintain solution.

wchargin commented 5 years ago

Thanks for writing this up! Quick thoughts, will re-read later:

Addresses are often quite long

Yes: note that we are talking about JSON-stringified addresses, which render NUL characters as \u0000 (six bytes).

The average length of a JSON-stringified address in the SourceCred graph (including the quotes) is 102 bytes:

$ gzip -dc /tmp/sourcecred/data/sourcecred/sourcecred/github/view.json.gz |
> jq '.[1][] | keys | .[]' | awk '{ x += length($0) } END { print x / NR }'
102.385

so this expression is likely dominated by n*A + e*8A

It would be great to do some quick empirical analysis of decompositions of various sizes. Could you please post the JSON files for, say:

Then you or I can use some simple jq queries to investigate exactly where the bytes stack up. (Doing this on the post-gzip view is less well defined.)

For instance, while addresses may be about 100 bytes each, floats tend to be about 20 bytes each when JSON-stringified, and likely compress more poorly: address formats are extremely regular. (All GitHub addresses share a common 30-byte prefix, and all edges for a given repository have an extra ${owner}/${name} in every address. There’s little entropy.)

It means that the serialized format becomes a part of Graph's public API. Right now this is not the case (GraphJSON is an opaque type).

As soon as we have such a thing as a public API, the serialized format will be part of it. The type may be opaque, but there is a contract that data serialized with toJSON will be readable with fromJSON.

Modify Graph to allow payloads.

As you probably suspect, I really don’t want to do this.

Another way to achieve something similarly general without polluting Graph would be to have a dedicated compactGraph module whose sole purpose is to define a serialization format for a graph with payloads. You avoid parameterizing the graph and dealing with the issues that arise there (what type parameters do we use on plugin adapters? how about merge?), and you only have to deal with the extra complexity at the serialization boundaries.

Comparison of Approaches

I’ll add one:

\5. Store the data in an SQLite database

With no indices, SQLite databases tend to be pretty space-efficient. They don’t have the gross byte inefficiencies of JSON (floats are actually stored in just 8 bytes!). We can also gzip them. See:

This simple schema should(?) suffice:

CREATE TABLE params (
    zero INTEGER NOT NULL PRIMARY KEY,  -- singleton table: value is 0
    self_loop_weight FLOAT NOT NULL
);
CREATE TABLE nodes (
    rowid INTEGER NOT NULL PRIMARY KEY,
    address BLOB UNIQUE NOT NULL,
    score FLOAT NOT NULL
);
CREATE TABLE edges (
    rowid INTEGER NOT NULL PRIMARY KEY,
    address BLOB NOT NULL,
    src INTEGER NOT NULL,
    dst INTEGER NOT NULL,
    weight FLOAT NOT NULL,
    FOREIGN KEY(src) REFERENCES nodes(id),
    FOREIGN KEY(dst) REFERENCES nodes(id)
);

To do this in production, we’d want to look into SQLite-in-browser. But it should be easy to prototype this on the server to see what the space looks like. Plus, SQLite-in-browser is a reasonable thing to look into anyway.

teamdandelion commented 5 years ago

Here are the results of a more detailed analysis. The tool used for reproducing these analyses can be found in the 1004-investigation branch. (Commit f2d39673c76caecb9eebbcb78f7e689637b01b91)

In each case, we show the sizes of each data structure (compressed and uncompressed), as computed by actually writing the data to disk. Also, we have a summary table showing the overall performance of three approaches described above:

On sourcecred/sourcecred

3.0M edgeAddressToEdgeWeightObject.json
172K edgeAddressToEdgeWeightObject.json.gz
2.9M graph.json
276K graph.json.gz
640K nodeAddressToScoreObject.json
104K nodeAddressToScoreObject.json.gz
 19M pagerankNodeDecomposition.json
956K pagerankNodeDecomposition.json.gz
408K unsortedEdgeWeightArray.json
 12K unsortedEdgeWeightArray.json.gz
104K unsortedNodeScoreArray.json
 48K unsortedNodeScoreArray.json.gz
Name Uncompressed Compressed Uncompressed/Optimal Compressed/Optimal
pND 19.1M 1.0M 5.48x 3.08x
easy 6.7M 0.5M 1.92x 1.71x
optimal 3.5M 0.3M 1.00x 1.00x

On ipfs/go-ipfs

❯ ls -sh $SOURCECRED_DIRECTORY/data/ipfs/go-ipfs | grep json        
 48M edgeAddressToEdgeWeightObject.json
2.2M edgeAddressToEdgeWeightObject.json.gz
 43M graph.json
3.3M graph.json.gz
6.4M nodeAddressToScoreObject.json
992K nodeAddressToScoreObject.json.gz
261M pagerankNodeDecomposition.json
 13M pagerankNodeDecomposition.json.gz
5.7M unsortedEdgeWeightArray.json
 40K unsortedEdgeWeightArray.json.gz
1.1M unsortedNodeScoreArray.json
392K unsortedNodeScoreArray.json.gz
Name Uncompressed Compressed Uncompressed/Optimal Compressed/Optimal
pND 272.9M 13.2M 5.25x 3.48x
easy 101.5M 6.6M 1.95x 1.73x
optimal 52.0M 3.8M 1.00x 1.00x

On twbs/bootstrap

❯ ls -sh $SOURCECRED_DIRECTORY/data/twbs/bootstrap | grep json
110M edgeAddressToEdgeWeightObject.json
5.2M edgeAddressToEdgeWeightObject.json.gz
104M graph.json
8.2M graph.json.gz
 21M nodeAddressToScoreObject.json
2.9M nodeAddressToScoreObject.json.gz
635M pagerankNodeDecomposition.json
 35M pagerankNodeDecomposition.json.gz
 14M unsortedEdgeWeightArray.json
 80K unsortedEdgeWeightArray.json.gz
3.5M unsortedNodeScoreArray.json
1.3M unsortedNodeScoreArray.json.gz
Name Uncompressed Compressed Uncompressed/Optimal Compressed/Optimal
pND 665.0M 35.7M 5.28x 3.61x
easy 244.3M 16.9M 1.94x 1.71x
optimal 126.0M 9.9M 1.00x 1.00x

Not visible in these results, but definitely worth noting: computing the pagerankNodeDecomposition is very expensive, both in terms of compute and memory.

For twbs/bootstrap, running analyze takes 124s if I compute the pND, and 70s if I leave it out (still running PageRank, etc).

Also, if we compute the pND it blows the standard node heap size (and needs to be run with --max-old-space-size); if we leave the pND out it's fine.

teamdandelion commented 5 years ago

Yes: note that we are talking about JSON-stringified addresses, which render NUL characters as \u0000 (six bytes).

For the graph in particular, we actually store the ["array", "of", "parts"] without null separators.

It would be great to do some quick empirical analysis of decompositions of various sizes. Could you please post the JSON files for, say:

Sizes posted in the comment above; you can also get the tool from the linked commit and run it yourself to do jq investigations.

Use a SQLite database

Your schema looks pretty neat, although I'll note that the edge addresses should be unique, and the edge weight should be toWeight and froWeight separately.

I'm disinclined to make the switch to SQLite right now, since it complicates the contract between frontend and backend, and requires adding a new dependency, more research, etc. I'm more interested in making a reasonable and relatively easy choice here that is good enough to unblock #935.

With that said, either the "easy" or "optimal" approaches seem acceptable. The "easy" one is only 17M compressed for twbs/bootstrap, so we shouldn't get blocked on the 100M file size limit too soon. The "optimal" approach isn't actually that hard to implement either. If we are willing to pay for extra sorts when we serialize/deserialize, we could just serialize the arrays in address-sorted order, without needing to peek into Graph's JSON representation.

Since "optimal" is an interface-compatible improvement to the serialization strategy for the data pipeline created in "easy", my inclination right now is to implement the easy approach. First I'll have pagerank return the new data structure rather than the decomposition, and update downstream consumers to compute node-wise decomposition when it's actually needed for the UI. Then draw the rest of the owl as described in #967 and #935. I can add a TODO to improve the serialization if/when we want to collect the extra 1.7x improvement in serialized size.

What do you think?

wchargin commented 5 years ago

For the graph in particular, we actually store the ["array", "of", "parts"] without null separators.

Right; the relational view stores raw addresses.

I'm disinclined to make the switch to SQLite right now, since it complicates the contract between frontend and backend, and requires adding a new dependency, more research, etc. I'm more interested in making a reasonable and relatively easy choice here that is good enough to unblock #935.

Sounds entirely reasonable to me. I don’t see how it “complicates the contract between the frontend and backend”, but everything else makes sense and the proposed path is unobjectionable to me.

Since "optimal" is an interface-compatible improvement

Again, it’s code-compatible, but data-incompatible, so this is fine with me only because we don’t have any public API guarantees.

What do you think?

Sounds great; go for it.

teamdandelion commented 5 years ago

In https://github.com/sourcecred/sourcecred/pull/1008, I add the WeightedGraph type, which wraps together a lot of the relevant information for running PageRank:

export type WeightedGraph = {|
  +graph: Graph,
  +edgeWeights: Map<EdgeAddressT, EdgeWeight>,
  +nodeTotalOutWeights: Map<NodeAddressT, number>,
  +syntheticLoopWeight: number,
|};

Then, in a comment, I proposed adding ScoredWeightedGraph (name pending improvement) with a signature like this:

export type ScoredWeightedGraph = {|
  +weightedGraph: WeightedGraph,
  +scores: NodeScore,
|}

In general, the scores are almost meaningless without the associated WeightedGraph, so in practice wherever we pass scores, we tend to pass the Graph or WeightedGraph as a sibling. Explicitly packaging them together in that type saves the trouble of passing two variables around. It also would allow us to enforce at the type level that you cannot mix and match the scores from one call to PageRank with a different graph. (By exporting the ScoredWeightedGraph as an opaque subtype of a structurally identical type.)

If we create a ScoredWeightedGraph, we could actually put the syntheticLoopWeight on the scoredWeightedGraph, which accurately reflects the fact that for a given WeightedGraph, you could run Pagerank on it with different syntheticLoopWeights, and get different ScoredWeightedGraph. (In practice, we never change the synethicLoopWeight, so it's not a very weighty decision.

As I mentioned in the comment, the disadvantage with this nested data structure is that it's somewhat ungainly to work with:

The disadvantage is that typing scoredWeightedGraph.weightedGraph.graph is fairly tedious. (Or swg.weightedGraph.graph in any case.) Maybe swg.wg.graph?

As I was falling asleep last night, I realized that one way around this would be to make the WeightedGraph and ScoredWeightedGraph into classes. Here's a sketch of how this would work:

export interface GraphAccessor{
  hasNode(a: NodeAddressT): boolean;
  nodes(options?: {|+prefix: NodeAddressT|}): Iterator<NodeAddressT>;
  hasEdge(address: EdgeAddressT): boolean;
  edge(address: EdgeAddressT): ?Edge;
  edges(options?: EdgesOptions): Iterator<Edge>;
  neighbors(node: NodeAddressT, options: NeighborsOptions): Iterator<Neighbor>;
}

export class Graph implements GraphAccessor{
  // same implementation as currently
}

export interface WeightedGraphAccessor extends GraphAccessor {
  allConnections(): NodeToConnections;
  nodeConnections(n: NodeAddressT): Connection[];
}

export class WeightedGraph implements WeightedGraphAccessor{
  _graph: Graph;
  _edgeWeights: Map<EdgeAddressT, EdgeWeight>;
  _syntheticLoopWeight: number;
  _nodeTotalOutWeight: Map<NodeAddressT, number>

  constructor(graph: Graph, edgeWeights: Map<EdgeAddressT, EdgeWeight>, syntheticLoopWeight: number) {
    this._graph = graph;
    this._edgeWeights = edgeWeights;
    this._syntheticLoopWeight = syntheticLoopWeight;
    // validate that there is a bijection between EdgeWeights and edges
    // compute _nodeTotalOutWeight
  }

  hasNode(a: NodeAddressT): boolean {
    return this._graph.hasNode(a);
  }

  edgeWeight(e: EdgeAddressT): EdgeWeight {
    return NullUtil.get(this.edgeWeights.get(e));
  }

  toJSON(): WeightedGraphJSON { }

  //...
}

export interface ScoredGraphAccessor extends WeightedGraphAccessor {
  score(n: NodeAddressT): number;
  scoredConnections(n: NodeAddressT): ScoredConnection[];
}

export class ScoredWeightedGraph implements ScoredGraphAccessor {
  // ...
}

I believe this has some very nice properties as compared to packaging the weighted graph and scored graph as structs:

teamdandelion commented 5 years ago

After another night thinking on this, I intend to make some changes:

  1. I think that having the ReadOnlyGraph interface is unnecessary. What it offers us is flexibility in cases where we aren't sure whether we'll get a Graph or a WeightedGraph or ScoredGraph, but in practice I don't think we will ever need that flexibility; clients can just require the precise kind of Graph that's suitable.

  2. We offer a richer API on WeightedGraph. E.g., we can offer the following APIs:

    export type WeightedEdge = {|+edge: Edge, +weights: EdgeWeights|};
    export class WeightedGraph {
    ...
    weightedEdge(EdgeAddressT): ?WeightedEdge;
    addWeightedEdge(WeightedEdge);
    neighbors(NodeAddressT, NeighborOptions): Iterator<{+NodeAddressT, +WeightedEdge|}>;

Once we get around to creating a ScoredGraph, we can enhance neighbors even further, so that it gives the score decomposition information (i.e., for each neighbor, how much of node's score came from that neighbor). I believe that will be a much more consistent and familiar API to external callers than providing the raw Markov Chain connections (conservation of concepts, neighbors is already a familiar concept to API consumers). (One controversial "feature" of this API is that it will likely hide the synthetic loops, since those are not a neighbor. We can make it still retrievable via another method, perhaps?)

Once we have a class that manages the ScoredGraph, that also becomes a place to put some extra logic around re-running PageRank. For example, we've long discussed that when we change the weights on a graph, we should be able to "resume" PageRank from its previous convergence, rather than re-running from scratch. If we have a ScoredGraph, then perhaps that can be the place where this behavior lives. Except, it cuts across the abstractions a bit inelegantly, because the weights are actually the purview of the WeightedGraph. So maybe re-weighting on the ScoredGraph mutates its contained WeightedGraph. Or maybe we could combine the WeightedGraph and ScoredGraph into a single abstraction, perhaps a PagerankGraph?

teamdandelion commented 5 years ago

Having slept on this another night, I'm quite intrigued by the idea of PagerankGraph, with an interface like follows. (Note some liberties were taking, e.g. exporting types right where they're used, so it doesn't actually pass flow.)

export type {EdgeWeight} from "./attribution/graphToMarkovChain";
export type EdgeEvaluator = (Edge) => EdgeWeight;

export type ScoreNormalizationStrategy = ScoreByConstantTotal;
// Score so that the sum of all nodes matching `nodePrefix` is equal to `total`
export type ScoreByConstantTotal = {|
  +type: "SCORE_BY_CONSTANT_TOTAL",
  +total: number,
  +nodePrefix: NodeAddressT,
|};

/**
 * A PagerankGraph is a wrapper around a Graph that makes it convenient to
 * compute Pagerank scores on that graph. It supports stateful operations like
 * changing the weighting of the graph, and re-running PageRank using the
 * previous distribution as a starting point. It also supports decomposing a
 * node's score in terms of its neighbors, and serializing the weighted and
 * scored state.
 *
 * By default, PageRank scores represent a distribution over the nodes, i.e.
 * the scores sum to `1`. Optionally, you may provide a
 * ScoreNormalizationStrategy, which specifies a different way to normalize the
 * scores. For example, if you want the sum of all nodes matching a given
 * prefix to sum to 1000, you can specify that via a
 * ScoreNormalizationStrategy.
 */
export class PagerankGraph {
  constructor(
    // The Graph backing the PagerankGraph. Mutating the Graph invalidates
    // the PagerankGraph (can ensure this either by putting a .seal() method
    // on Graph, or exposing the modification count, or by having PagerankGraph
    // create a copy on construction).
    graph: Graph,
    scoreNormalizationStrategy: ?ScoreNormalizationStrategy,
    // The synthetic loop weight; uses a sane default if not provided
    syntheticLoopWeight: ?number,
    // The scores; defaults to uniform distribution if not provided
    scores: ?Map<NodeAddressT, number>,
    // The edge weights; defaults to {toWeight: 1, froWeight:1} for each edge if
    // not provided
    edgeWeights: ?Map<EdgeAddressT, EdgeWeight>
  ): void {
    throw new Error("Not yet implemented");
  }

  hasNode(a: NodeAddressT): boolean {
    throw new Error("Not implemented");
  }

  export type ScoredNode = {|
    +score: number,
    +node: NodeAddressT,
  |};

  *nodes(options?: {|+prefix: NodeAddressT|}): Iterator<ScoredNode> {
    throw new Error("Not implemented");
  }

  hasEdge(address: EdgeAddressT): boolean {
    throw new Error("Not implemented");
  }

  export type WeightedEdge = {|+edge: Edge, +weight: EdgeWeight|};
  edge(address: EdgeAddressT): ?WeightedEdge {
    throw new Error("Not implemented");
  }

  *edges(options?: EdgesOptions): Iterator<WeightedEdge> {
    throw new Error("Not implemented");
  }

  export type ScoredNeighbor = {|
    // Node adjacent to the target
    +node: NodeAddressT,
    // Edge connecting node to the target
    +edge: Edge,
    // Weight of `edge`
    +weight: EdgeWeight,
    // How much score (in absolute terms) was contributed
    // to target from node via `edge`
    +scoreContribution: number,
  |};

  *neighbors(
    node: NodeAddressT,
    options: NeighborsOptions
  ): Iterator<ScoredNeighbor> {
    // Note: if the graph hasn't been iterated to convergence, then the scores
    // may be inconsistent.
    throw new Error("Not implemented");
  }

  /**
   * Reweight the graph, using the given edge evaluator.
   * Note that after re-weighting the graph, you probably also want to
   * rerun PageRank.
   */
  reweight(evaluator: EdgeEvaluator): this {
    throw new Error("Not implemented");
  }

  export type ConvergenceOptions = {|
    +maxIterations: number,
    +convergenceThreshold: number,
  |};

  export type ConvergenceReport = {|
    +iterations: number,
    +finalDelta: number,
    +timeElapsedMs: number,
  |};

  /**
   * Run PageRank until the graph is converged to within
   * `options.convergenceThreshold`, or until
   * `options.maxSteps` is exceeded.
   */
  runPagerank(options: ConvergenceOptions): ConvergenceReport {
    throw new Error("Not implemented");
  }

  syntheticLoopWeight(): number {
    throw new Error("Not implemented");
  }

  equals(that: PagerankGraph): boolean {
    throw new Error("Not implemented");
  }

  toJSON(): PagerankGraphJSON {
    throw new Error("Not implemented");
  }

  static fromJSON(json: PagerankGraphJSON): PagerankGraph {
    throw new Error("Not implemented");
  }
}

I believe this will effect a substantial overall improvement to SourceCred's core API. The "core" of SourceCred is running PageRank on weighted graphs. We've invested a lot of care and attention into building a robust Graph class that can serve as the basis for running PageRank. However, running PageRank, and then interpreting the results (i.e. score decomposition), is scattered across a number of other modules and ad-hoc data structures; API consumers need to learn new concepts like markov chain connections, and there is no first-class support for serializing the results of running PageRank, or for re-weighting and re-running the graph in an efficient fashion.

Implementing this class--or something like it--will give PageRank the first-class API support that it deserves, and make the core SourceCred PageRank infrastructure more accessible, both to new contributors to the codebase and to others that would like to re-use it for other purposes. Observe, for example, that PagerankGraph.neighbors returns the same information as is currently exposed via pagerankNodeDecomposition.decompose, except using the same conceptual building blocks as core.Graph instead of an entirely new set of data primitives.

teamdandelion commented 5 years ago

Friendly ping @wchargin

wchargin commented 5 years ago

I’ve seen a ton of notifications on my timeline and frankly have been a bit overwhelmed—is this a good place to start?

teamdandelion commented 5 years ago

Ah, yes, I can see that the notification blizzard is not very actionable. I'll take some time to clean this up and write a design doc for the PageRankGraph, along with alternatives considered. I'll ping you once that doc is ready for review.

wchargin commented 5 years ago

Great; I look forward to it.