Closed teamdandelion closed 4 years ago
Thanks for writing this up! Quick thoughts, will re-read later:
Addresses are often quite long
Yes: note that we are talking about JSON-stringified addresses, which
render NUL characters as \u0000
(six bytes).
The average length of a JSON-stringified address in the SourceCred graph (including the quotes) is 102 bytes:
$ gzip -dc /tmp/sourcecred/data/sourcecred/sourcecred/github/view.json.gz |
> jq '.[1][] | keys | .[]' | awk '{ x += length($0) } END { print x / NR }'
102.385
so this expression is likely dominated by
n*A + e*8A
It would be great to do some quick empirical analysis of decompositions of various sizes. Could you please post the JSON files for, say:
Then you or I can use some simple jq
queries to investigate exactly
where the bytes stack up. (Doing this on the post-gzip view is less well
defined.)
For instance, while addresses may be about 100 bytes each, floats tend
to be about 20 bytes each when JSON-stringified, and likely compress
more poorly: address formats are extremely regular. (All GitHub
addresses share a common 30-byte prefix, and all edges for a given
repository have an extra ${owner}/${name}
in every address. There’s
little entropy.)
It means that the serialized format becomes a part of Graph's public API. Right now this is not the case (
GraphJSON
is an opaque type).
As soon as we have such a thing as a public API, the serialized format
will be part of it. The type may be opaque, but there is a contract
that data serialized with toJSON
will be readable with fromJSON
.
Modify Graph to allow payloads.
As you probably suspect, I really don’t want to do this.
Another way to achieve something similarly general without polluting
Graph
would be to have a dedicated compactGraph
module whose sole
purpose is to define a serialization format for a graph with payloads.
You avoid parameterizing the graph and dealing with the issues that
arise there (what type parameters do we use on plugin adapters? how
about merge
?), and you only have to deal with the extra complexity at
the serialization boundaries.
Comparison of Approaches
I’ll add one:
\5. Store the data in an SQLite database
With no indices, SQLite databases tend to be pretty space-efficient. They don’t have the gross byte inefficiencies of JSON (floats are actually stored in just 8 bytes!). We can also gzip them. See:
This simple schema should(?) suffice:
CREATE TABLE params (
zero INTEGER NOT NULL PRIMARY KEY, -- singleton table: value is 0
self_loop_weight FLOAT NOT NULL
);
CREATE TABLE nodes (
rowid INTEGER NOT NULL PRIMARY KEY,
address BLOB UNIQUE NOT NULL,
score FLOAT NOT NULL
);
CREATE TABLE edges (
rowid INTEGER NOT NULL PRIMARY KEY,
address BLOB NOT NULL,
src INTEGER NOT NULL,
dst INTEGER NOT NULL,
weight FLOAT NOT NULL,
FOREIGN KEY(src) REFERENCES nodes(id),
FOREIGN KEY(dst) REFERENCES nodes(id)
);
To do this in production, we’d want to look into SQLite-in-browser. But it should be easy to prototype this on the server to see what the space looks like. Plus, SQLite-in-browser is a reasonable thing to look into anyway.
Here are the results of a more detailed analysis. The tool used for reproducing these analyses can be found in the 1004-investigation branch. (Commit f2d39673c76caecb9eebbcb78f7e689637b01b91)
In each case, we show the sizes of each data structure (compressed and uncompressed), as computed by actually writing the data to disk. Also, we have a summary table showing the overall performance of three approaches described above:
3.0M edgeAddressToEdgeWeightObject.json
172K edgeAddressToEdgeWeightObject.json.gz
2.9M graph.json
276K graph.json.gz
640K nodeAddressToScoreObject.json
104K nodeAddressToScoreObject.json.gz
19M pagerankNodeDecomposition.json
956K pagerankNodeDecomposition.json.gz
408K unsortedEdgeWeightArray.json
12K unsortedEdgeWeightArray.json.gz
104K unsortedNodeScoreArray.json
48K unsortedNodeScoreArray.json.gz
Name | Uncompressed | Compressed | Uncompressed/Optimal | Compressed/Optimal |
---|---|---|---|---|
pND | 19.1M | 1.0M | 5.48x | 3.08x |
easy | 6.7M | 0.5M | 1.92x | 1.71x |
optimal | 3.5M | 0.3M | 1.00x | 1.00x |
❯ ls -sh $SOURCECRED_DIRECTORY/data/ipfs/go-ipfs | grep json
48M edgeAddressToEdgeWeightObject.json
2.2M edgeAddressToEdgeWeightObject.json.gz
43M graph.json
3.3M graph.json.gz
6.4M nodeAddressToScoreObject.json
992K nodeAddressToScoreObject.json.gz
261M pagerankNodeDecomposition.json
13M pagerankNodeDecomposition.json.gz
5.7M unsortedEdgeWeightArray.json
40K unsortedEdgeWeightArray.json.gz
1.1M unsortedNodeScoreArray.json
392K unsortedNodeScoreArray.json.gz
Name | Uncompressed | Compressed | Uncompressed/Optimal | Compressed/Optimal |
---|---|---|---|---|
pND | 272.9M | 13.2M | 5.25x | 3.48x |
easy | 101.5M | 6.6M | 1.95x | 1.73x |
optimal | 52.0M | 3.8M | 1.00x | 1.00x |
❯ ls -sh $SOURCECRED_DIRECTORY/data/twbs/bootstrap | grep json
110M edgeAddressToEdgeWeightObject.json
5.2M edgeAddressToEdgeWeightObject.json.gz
104M graph.json
8.2M graph.json.gz
21M nodeAddressToScoreObject.json
2.9M nodeAddressToScoreObject.json.gz
635M pagerankNodeDecomposition.json
35M pagerankNodeDecomposition.json.gz
14M unsortedEdgeWeightArray.json
80K unsortedEdgeWeightArray.json.gz
3.5M unsortedNodeScoreArray.json
1.3M unsortedNodeScoreArray.json.gz
Name | Uncompressed | Compressed | Uncompressed/Optimal | Compressed/Optimal |
---|---|---|---|---|
pND | 665.0M | 35.7M | 5.28x | 3.61x |
easy | 244.3M | 16.9M | 1.94x | 1.71x |
optimal | 126.0M | 9.9M | 1.00x | 1.00x |
Not visible in these results, but definitely worth noting: computing the pagerankNodeDecomposition is very expensive, both in terms of compute and memory.
For twbs/bootstrap, running analyze
takes 124s if I compute the pND, and 70s if I leave it out (still running PageRank, etc).
Also, if we compute the pND it blows the standard node heap size (and needs to be run with --max-old-space-size); if we leave the pND out it's fine.
Yes: note that we are talking about JSON-stringified addresses, which render NUL characters as \u0000 (six bytes).
For the graph in particular, we actually store the ["array", "of", "parts"] without null separators.
It would be great to do some quick empirical analysis of decompositions of various sizes. Could you please post the JSON files for, say:
Sizes posted in the comment above; you can also get the tool from the linked commit and run it yourself to do jq investigations.
Use a SQLite database
Your schema looks pretty neat, although I'll note that the edge addresses should be unique, and the edge weight should be toWeight and froWeight separately.
I'm disinclined to make the switch to SQLite right now, since it complicates the contract between frontend and backend, and requires adding a new dependency, more research, etc. I'm more interested in making a reasonable and relatively easy choice here that is good enough to unblock #935.
With that said, either the "easy" or "optimal" approaches seem acceptable. The "easy" one is only 17M compressed for twbs/bootstrap, so we shouldn't get blocked on the 100M file size limit too soon. The "optimal" approach isn't actually that hard to implement either. If we are willing to pay for extra sorts when we serialize/deserialize, we could just serialize the arrays in address-sorted order, without needing to peek into Graph's JSON representation.
Since "optimal" is an interface-compatible improvement to the serialization strategy for the data pipeline created in "easy", my inclination right now is to implement the easy approach. First I'll have pagerank
return the new data structure rather than the decomposition, and update downstream consumers to compute node-wise decomposition when it's actually needed for the UI. Then draw the rest of the owl as described in #967 and #935. I can add a TODO to improve the serialization if/when we want to collect the extra 1.7x improvement in serialized size.
What do you think?
For the graph in particular, we actually store the ["array", "of", "parts"] without null separators.
Right; the relational view stores raw addresses.
I'm disinclined to make the switch to SQLite right now, since it complicates the contract between frontend and backend, and requires adding a new dependency, more research, etc. I'm more interested in making a reasonable and relatively easy choice here that is good enough to unblock #935.
Sounds entirely reasonable to me. I don’t see how it “complicates the contract between the frontend and backend”, but everything else makes sense and the proposed path is unobjectionable to me.
Since "optimal" is an interface-compatible improvement
Again, it’s code-compatible, but data-incompatible, so this is fine with me only because we don’t have any public API guarantees.
What do you think?
Sounds great; go for it.
In https://github.com/sourcecred/sourcecred/pull/1008, I add the WeightedGraph
type, which wraps together a lot of the relevant information for running PageRank:
export type WeightedGraph = {|
+graph: Graph,
+edgeWeights: Map<EdgeAddressT, EdgeWeight>,
+nodeTotalOutWeights: Map<NodeAddressT, number>,
+syntheticLoopWeight: number,
|};
Then, in a comment, I proposed adding ScoredWeightedGraph
(name pending improvement) with a signature like this:
export type ScoredWeightedGraph = {|
+weightedGraph: WeightedGraph,
+scores: NodeScore,
|}
In general, the scores are almost meaningless without the associated
WeightedGraph
, so in practice wherever we pass scores, we tend to pass the Graph or WeightedGraph as a sibling. Explicitly packaging them together in that type saves the trouble of passing two variables around. It also would allow us to enforce at the type level that you cannot mix and match the scores from one call to PageRank with a different graph. (By exporting the ScoredWeightedGraph as an opaque subtype of a structurally identical type.)
If we create a ScoredWeightedGraph
, we could actually put the syntheticLoopWeight on the scoredWeightedGraph, which accurately reflects the fact that for a given WeightedGraph, you could run Pagerank on it with different syntheticLoopWeights, and get different ScoredWeightedGraph. (In practice, we never change the synethicLoopWeight, so it's not a very weighty decision.
As I mentioned in the comment, the disadvantage with this nested data structure is that it's somewhat ungainly to work with:
The disadvantage is that typing
scoredWeightedGraph.weightedGraph.graph
is fairly tedious. (Orswg.weightedGraph.graph
in any case.) Maybeswg.wg.graph
?
As I was falling asleep last night, I realized that one way around this would be to make the WeightedGraph and ScoredWeightedGraph into classes. Here's a sketch of how this would work:
export interface GraphAccessor{
hasNode(a: NodeAddressT): boolean;
nodes(options?: {|+prefix: NodeAddressT|}): Iterator<NodeAddressT>;
hasEdge(address: EdgeAddressT): boolean;
edge(address: EdgeAddressT): ?Edge;
edges(options?: EdgesOptions): Iterator<Edge>;
neighbors(node: NodeAddressT, options: NeighborsOptions): Iterator<Neighbor>;
}
export class Graph implements GraphAccessor{
// same implementation as currently
}
export interface WeightedGraphAccessor extends GraphAccessor {
allConnections(): NodeToConnections;
nodeConnections(n: NodeAddressT): Connection[];
}
export class WeightedGraph implements WeightedGraphAccessor{
_graph: Graph;
_edgeWeights: Map<EdgeAddressT, EdgeWeight>;
_syntheticLoopWeight: number;
_nodeTotalOutWeight: Map<NodeAddressT, number>
constructor(graph: Graph, edgeWeights: Map<EdgeAddressT, EdgeWeight>, syntheticLoopWeight: number) {
this._graph = graph;
this._edgeWeights = edgeWeights;
this._syntheticLoopWeight = syntheticLoopWeight;
// validate that there is a bijection between EdgeWeights and edges
// compute _nodeTotalOutWeight
}
hasNode(a: NodeAddressT): boolean {
return this._graph.hasNode(a);
}
edgeWeight(e: EdgeAddressT): EdgeWeight {
return NullUtil.get(this.edgeWeights.get(e));
}
toJSON(): WeightedGraphJSON { }
//...
}
export interface ScoredGraphAccessor extends WeightedGraphAccessor {
score(n: NodeAddressT): number;
scoredConnections(n: NodeAddressT): ScoredConnection[];
}
export class ScoredWeightedGraph implements ScoredGraphAccessor {
// ...
}
I believe this has some very nice properties as compared to packaging the weighted graph and scored graph as structs:
createConnections
to the class rather than having it as a free-floating functionAfter another night thinking on this, I intend to make some changes:
I think that having the ReadOnlyGraph
interface is unnecessary. What it offers us is flexibility in cases where we aren't sure whether we'll get a Graph
or a WeightedGraph
or ScoredGraph
, but in practice I don't think we will ever need that flexibility; clients can just require the precise kind of Graph that's suitable.
We offer a richer API on WeightedGraph
. E.g., we can offer the following APIs:
export type WeightedEdge = {|+edge: Edge, +weights: EdgeWeights|};
export class WeightedGraph {
...
weightedEdge(EdgeAddressT): ?WeightedEdge;
addWeightedEdge(WeightedEdge);
neighbors(NodeAddressT, NeighborOptions): Iterator<{+NodeAddressT, +WeightedEdge|}>;
Once we get around to creating a ScoredGraph
, we can enhance neighbors
even further, so that it gives the score decomposition information (i.e., for each neighbor, how much of node's score came from that neighbor). I believe that will be a much more consistent and familiar API to external callers than providing the raw Markov Chain connections (conservation of concepts, neighbors
is already a familiar concept to API consumers). (One controversial "feature" of this API is that it will likely hide the synthetic loops, since those are not a neighbor. We can make it still retrievable via another method, perhaps?)
Once we have a class that manages the ScoredGraph
, that also becomes a place to put some extra logic around re-running PageRank. For example, we've long discussed that when we change the weights on a graph, we should be able to "resume" PageRank from its previous convergence, rather than re-running from scratch. If we have a ScoredGraph
, then perhaps that can be the place where this behavior lives. Except, it cuts across the abstractions a bit inelegantly, because the weights are actually the purview of the WeightedGraph
. So maybe re-weighting on the ScoredGraph
mutates its contained WeightedGraph
. Or maybe we could combine the WeightedGraph
and ScoredGraph
into a single abstraction, perhaps a PagerankGraph
?
Having slept on this another night,
I'm quite intrigued by the idea of PagerankGraph
, with an interface like
follows. (Note some liberties were taking, e.g. exporting types right where they're
used, so it doesn't actually pass flow.)
export type {EdgeWeight} from "./attribution/graphToMarkovChain";
export type EdgeEvaluator = (Edge) => EdgeWeight;
export type ScoreNormalizationStrategy = ScoreByConstantTotal;
// Score so that the sum of all nodes matching `nodePrefix` is equal to `total`
export type ScoreByConstantTotal = {|
+type: "SCORE_BY_CONSTANT_TOTAL",
+total: number,
+nodePrefix: NodeAddressT,
|};
/**
* A PagerankGraph is a wrapper around a Graph that makes it convenient to
* compute Pagerank scores on that graph. It supports stateful operations like
* changing the weighting of the graph, and re-running PageRank using the
* previous distribution as a starting point. It also supports decomposing a
* node's score in terms of its neighbors, and serializing the weighted and
* scored state.
*
* By default, PageRank scores represent a distribution over the nodes, i.e.
* the scores sum to `1`. Optionally, you may provide a
* ScoreNormalizationStrategy, which specifies a different way to normalize the
* scores. For example, if you want the sum of all nodes matching a given
* prefix to sum to 1000, you can specify that via a
* ScoreNormalizationStrategy.
*/
export class PagerankGraph {
constructor(
// The Graph backing the PagerankGraph. Mutating the Graph invalidates
// the PagerankGraph (can ensure this either by putting a .seal() method
// on Graph, or exposing the modification count, or by having PagerankGraph
// create a copy on construction).
graph: Graph,
scoreNormalizationStrategy: ?ScoreNormalizationStrategy,
// The synthetic loop weight; uses a sane default if not provided
syntheticLoopWeight: ?number,
// The scores; defaults to uniform distribution if not provided
scores: ?Map<NodeAddressT, number>,
// The edge weights; defaults to {toWeight: 1, froWeight:1} for each edge if
// not provided
edgeWeights: ?Map<EdgeAddressT, EdgeWeight>
): void {
throw new Error("Not yet implemented");
}
hasNode(a: NodeAddressT): boolean {
throw new Error("Not implemented");
}
export type ScoredNode = {|
+score: number,
+node: NodeAddressT,
|};
*nodes(options?: {|+prefix: NodeAddressT|}): Iterator<ScoredNode> {
throw new Error("Not implemented");
}
hasEdge(address: EdgeAddressT): boolean {
throw new Error("Not implemented");
}
export type WeightedEdge = {|+edge: Edge, +weight: EdgeWeight|};
edge(address: EdgeAddressT): ?WeightedEdge {
throw new Error("Not implemented");
}
*edges(options?: EdgesOptions): Iterator<WeightedEdge> {
throw new Error("Not implemented");
}
export type ScoredNeighbor = {|
// Node adjacent to the target
+node: NodeAddressT,
// Edge connecting node to the target
+edge: Edge,
// Weight of `edge`
+weight: EdgeWeight,
// How much score (in absolute terms) was contributed
// to target from node via `edge`
+scoreContribution: number,
|};
*neighbors(
node: NodeAddressT,
options: NeighborsOptions
): Iterator<ScoredNeighbor> {
// Note: if the graph hasn't been iterated to convergence, then the scores
// may be inconsistent.
throw new Error("Not implemented");
}
/**
* Reweight the graph, using the given edge evaluator.
* Note that after re-weighting the graph, you probably also want to
* rerun PageRank.
*/
reweight(evaluator: EdgeEvaluator): this {
throw new Error("Not implemented");
}
export type ConvergenceOptions = {|
+maxIterations: number,
+convergenceThreshold: number,
|};
export type ConvergenceReport = {|
+iterations: number,
+finalDelta: number,
+timeElapsedMs: number,
|};
/**
* Run PageRank until the graph is converged to within
* `options.convergenceThreshold`, or until
* `options.maxSteps` is exceeded.
*/
runPagerank(options: ConvergenceOptions): ConvergenceReport {
throw new Error("Not implemented");
}
syntheticLoopWeight(): number {
throw new Error("Not implemented");
}
equals(that: PagerankGraph): boolean {
throw new Error("Not implemented");
}
toJSON(): PagerankGraphJSON {
throw new Error("Not implemented");
}
static fromJSON(json: PagerankGraphJSON): PagerankGraph {
throw new Error("Not implemented");
}
}
I believe this will effect a substantial overall improvement to SourceCred's
core API. The "core" of SourceCred is running PageRank on weighted graphs.
We've invested a lot of care and attention into building a robust Graph
class
that can serve as the basis for running PageRank. However, running PageRank,
and then interpreting the results (i.e. score decomposition), is scattered
across a number of other modules and ad-hoc data structures; API consumers need
to learn new concepts like markov chain connections, and there is no
first-class support for serializing the results of running PageRank, or for
re-weighting and re-running the graph in an efficient fashion.
Implementing this class--or something like it--will give PageRank the
first-class API support that it deserves, and make the core SourceCred PageRank
infrastructure more accessible, both to new contributors to the codebase and to
others that would like to re-use it for other purposes. Observe, for example,
that PagerankGraph.neighbors
returns the same information as is currently
exposed via pagerankNodeDecomposition.decompose
, except using the same
conceptual building blocks as core.Graph
instead of an entirely new set of
data primitives.
Friendly ping @wchargin
I’ve seen a ton of notifications on my timeline and frankly have been a bit overwhelmed—is this a good place to start?
Ah, yes, I can see that the notification blizzard is not very actionable. I'll take some time to clean this up and write a design doc for the PageRankGraph, along with alternatives considered. I'll ping you once that doc is ready for review.
Great; I look forward to it.
Currently, we represent the results of PageRank via the
PagerankNodeDecomposition
. The NodeDecomposition is a rich data type in that it allows us to answer the following questions:Thus, the PagerankNodeDecomposition has all the data needed to run the cred analysis workflow of the explorer. (Although the explorer wants some additional plugin-specific metadata so that it can display node names prettily.)
Unfortunately, the PagerankNodeDecomposition is quite large. Now that we want to serialize these results to disk, we should find a better representation, lest we get blocked by the 100MB file size limit on GitHub pages.
Size of a PagerankNodeDecomposition
Why is the PagerankNodeDecomposition so large? Let's analyze it.
Let
n
be the number of nodes in the graph,e
be the number of edges,A
be the size of an address (for convenience we assume that node and edge addresses are the same length),F
be the size of a float, andK
be the size of a string constant ("SYNTHETIC_LOOP", "IN_EDGE", "OUT_EDGE"
).Then the size of a PagerankNodeDecomposition is:
n * (K + 3F + A) + e * (2K + 8A + 4F)
Addresses are often quite long, so this expression is likely dominated by
n*A + e*8A
(Derivation)
At a top level, we have: ``` export type PagerankNodeDecomposition = Map< NodeAddressT, {| +score: number, // Connections are sorted by `adjacencyScore` descending, // breaking ties in a deterministic (but unspecified) order. +scoredConnections: $ReadOnlyArraySize of a Graph
This is very inefficient compared to how we represent graphs.
When we store a graph, we store
nA + e(A+2F)
. We're able to get this performance because we are careful to only serialize each address once, and thereafter we reference it by index.Optimal Storage
Theoretically speaking, to encode the information in the PagerankNodeDecomposition, assuming that we already have the graph, we only need to store three kinds of information:
Thus, we should be able to store equivalent data in only
n(A+F) + e(A+4F) + F
(Note: this excludes additional display metadata for the cred explorer, i.e. we are not storing the human-readable names for each node.)
However: How to actually achieve this?
Less-Optimal But Convenient Implementation
A simple approach would be to store the extra data that we need in maps alongside the graph:
This does not leverage the integer indexing trick (except for in the graph), so we pay
F + nA + e(A+2F) + e(A+F) + n(A+F)
for a total ofF + 2nA + nF + 2eA + 3eF
or an extranA + eA
compared to optimal.Comparison of Approaches
If we use the pND, then we have to store
n+8e
addresses. If we implement a simple improvement, we store2n + 2e
addresses. If we implement the optimal approach, then we storen+e
addresses.Given this information, I see 4 major courses of action:
We write entirely new code for serializing
CredAnalysisResult
as defined above, and reimplement the integer indexing trick. This has the advantage that it is local, but duplicating the integer serialization logic (and, implicitly, coming up with a whole new serialized format for Graph) is very unappealing.We could have a method that calls toJSON on the Graph, and then iterates over the index-order nodes and edges of the serialized, using that to produce an ordered array of node scores and edge weights. Then we can do the reverse on deserialization.
This has the advantage that it involves re-implementing less logic, is straightforward to implement. However, depending on Graph's serialization format seems weird. It means that the serialized format becomes a part of Graph's public API. Right now this is not the case (
GraphJSON
is an opaque type).We could make Graph parameterizable by Node and Edge payload types. In this case, the node payload is a number, and the edge payload is an
EdgeWeight
. Then the Graph serialization logic can efficiently store the paylods. Then theCredAnalysisResult
type above could be declared as:This has the advantage that it doesn't re-implement graph's serialization, or subvert it. Also, once we have this, we may find other uses for it.
It has some disadvantages too:
Graph
class, and we're quite pleased that it's simple.toJSON/fromJSON
We can just implement the
CredAnalysisResult
with the maps alongside the graph. It's inefficient compared to optimal, but much better than using thePagerankNodeDecomposition
. And if the 100MB size limit becomes an issue, we could further navigate around that by splitting it into separate files (for each top-level field) before needing to implement any complicated or hard-to-maintain solution.