Design challenges discussion

pfrazee commented 8 years ago

A general search program for SSB needs to solve

Query effectiveness (the most basic: a query for "ssb" will return documents that have the word "ssb")
Index completeness (all searchable data-- log entries, plaintext blobs-- need to be indexed)
Result quality (results are ranked by signals)
Result correctness (results aren't wrong)
Query efficiency

pfrazee commented 8 years ago

In the first iteration (current)...

Effectiveness: relies entirely on https://github.com/cshum/levi. Just indexes the entire document.
Completeness: all log entries are indexed, no blobs yet
Quality: a very weak algorithm is here. There's a whois to search feeds and a whatis to search blobs. Results are ranked by the number of links in matching documents that share a reference. Authorship is not yet ranked and a single document referencing the same link N times will add N positive signals.
Correctness: again, the current algorithm gets varied results. Because the initial search is done on entire documents, but not on the output links, then any links in a matching document get counted as results, and sometimes those links are not relevant at all. So that's not going to get good results. More on this below.
Efficiency: probably ok but not given much thought yet.

Regarding correctness, SSB's logs are typed & semantic, and so it seems like a pretty natural choice to code knowledge of types and index accordingly. Question is, how much flexibility do we lose when we do that? I expect to use a scheme like...

Use specific-type knowledge. If possible, make this extensible, so you dont have to edit disco's code to add or alter type-knowledge. Maybe via a config file, or RPC APIs
Fall back on a generic algorithm, which doesnt encode special knowledge. This would be used for unknown message types and for blobs, which are not structured.

pfrazee commented 8 years ago

The current pipeline:

log → TF-IDF processing → TF-IDF index

query TF-IDF index → extract links → sort links by freq of target

pfrazee commented 8 years ago

Todo:

blob indexing, to improve completeness
semantic processing, to improve correctness
graph analysis and signal ranking, to improve quality

pfrazee commented 8 years ago

For semantic processing, I think the thing to do is normalize documents into a regular form for the TF-IDF index. This will enable field-boosting.

As an example, suppose we did title, type, text, and ref, with the following boost ranks:

{ title: 10, type: 5, text: 1, ref: 0 }

Then let's consider this message:

{
  type: 'post',
  text: 'hello @alice and @bob, how are you doing?',
  mentions: [{ name: 'alice', link: REF1 }, { name: 'bob', link: REF2 }]
}

We can process this into 3 different normalized documents:

{ title: null, type: 'post', text: 'hello @alice and @bob, how are you doing?', ref: MSG_KEY }
{ title: 'alice', type: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF1 }
{ title: 'bob', type: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF2 }

Now, a query for 'alice' will match all 3, but REF1 will be much higher.

Open question: it'd be better if a query for 'alice' only hit REF1 and MSG_KEY. Should the mention documents not include the text of the post? Would that degrade the results significantly?

pfrazee commented 8 years ago

Another opportunity, might be to use attributes in the normalized document, instead of type. This allows a list of attributes, almost like keywords, which are condensed from the document. For instance, using the same example:

{ title: 10, attributes: 5, text: 1, ref: 0 }

{ title: null, attributes: ['post', 'mentions alice', 'mentions bob'], text: 'hello @alice and @bob, how are you doing?', ref: MSG_KEY }
{ title: 'alice', attributes: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF1 }
{ title: 'bob', attributes: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF2 }

This way, a query of post mentions alice would rank MSG_KEY higher than REF1

pfrazee commented 8 years ago

graph analysis...

Levi has an API (scoreStream) for outputting the scores without sorting. Use that to remove an unneeded sort.

some signals

Inbound link graphrank (pagerank)
- Using all links doesn't take advantage of any semantic knowledge we have, but could still have value. (a msg with 1000 downvotes would be seen as having 1000 inbounds). measures "popularity" of a sort
- Alternatively, can use only links known to confer a positive connection (eg mentions)
Votes
Content analysis, check the message for signs of spamminess
Author graphrank - the "reputation" of the author

pfrazee commented 8 years ago

intuitively...

how closely does a document match your query?
how do trusted agents rate the document?
how popular is the document among trusted agents?

pfrazee commented 8 years ago

Some use-cases that I want to fulfill, maybe in disco, or in a separate tool --

Document discovery / lookup, what you generally think of with search
Shortname lookup (whois X).
- This might be generalizable as "trust- knowledge computation." I have a string, and I need to discover authoritative meaning for it. In the case of shortnames, I have a username, I need an authoritative feed id for it.
- This is probably going to rely on semantic domain (application) knowledge. (For instance Feed-naming is an application, it's driven (atm) by about messages and mention links.) But, this will also rely on general trustgraph knowledge.
"Plumbing" - looking up the right program to open a feed/msg/blob. took that term from this plan9 utility.
- With this I'm thinking about Brickhouse, and what will happen when you try to open a link. Rather than a fixed linking scheme, I've been thinking about using feed/msg/blob IDs + context + connected words to open the right program. For instance, edit &SOMEBLOB.sha256, where the blob ends up being a js file, should open sublime.
- This generally involves a feed/msg/blob id, but it doesnt have to. It's basically command invocation by search. You type a phrase, or an app hotlinks a phrase, and then disco searches for command invocations that might do what you're asking for.
- definitely requires some different kind of index... this may not be too different than shortname lookup? (application domain-knowldge + trust computation)

Some of the stuff in http://pfraze.github.io/2015/03/26/web-of-trust-roundup.html and http://pfraze.github.io/2015/03/24/computing-trust.html relates to all this.

pfrazee commented 8 years ago

Some excerpts / thoughts, pulling from papers mentioned by A survey of trust in computer science and the Semantic Web

A Distributed Trust Model

on transitive trust

A common assumption of most authentication protocols is that trust is transitive, i.e. the assumption that (Alice trusts Bob) & (Bob trusts Cathy) ⇒ Alice trusts Cathy. This is not generally true. We posit that transitivity may hold if certain conditions are met. We have termed this conditional transitivity and the conditions that may allow transitivity are (with reference to the example above): a) Bob explicitly communicates his trust in Cathy to Alice, as a ‘recommendation’. b) Alice trusts Bob as a recommender, i.e. recommender trust exists in the system. c) Alice is allowed to make judgements about the ‘quality’ of Bob’s recommendation (based on Alice’s policies). d) Trust is not absolute, i.e. Alice may trust Cathy less than Bob does, based on Bob’s recommendation.

PageRank is inferred from weblinks by interpreting them as transitive recommendation signals, in the domain of information-browsing. It would be incorrect to interpret SSB's links that way, because those links are not used to drive information-browsing, they are not used as recommendations, and they are not always transitive.

Trust Management for the Semantic Web

Proposes a model of Belief (in fact) statements and Trust (in user) statements. Uses real values (0-1). Rather than trying to get global knowledge of a trust graph, tries to use local knowledge from neighbors. Also, calculates trust from the perspective of an agent. Arguably, looking at generalizations of PageRank.

Some good snippets:

In this formulation, we consider a probabilistic interpretation of global belief combination. ... Imagine a random knowledge-surfer hopping from user to user in search of beliefs. At each step, the surfer probabilistically selects a neighbor to jump to according to the current user’s distribution of trusts. Then, with probability equal to the current user’s belief, it says “yes, I believe in the statement”. Otherwise, it says “no”. Further, when choosing which user to jump to, the random surfer will, with probability λi∈[0,1], ignore the trusts and instead jump directly back to the original user.

...

The web of trust calculation is not susceptible to “link-spamming,” a phenomenon in PageRank whereby a person may increase others’ trust in him by generating hundreds of virtual personas which all trust him. In PageRank, the uniform random jump of the surfer means that each virtual persona is bestowed some small amount of PageRank, which they ‘give’ to the spammer, thus increasing her rank. With a web of trust, this technique gains nothing unless the user is able to convince others to trust her virtual personas, which we expect will only occur if the personas actually provide useful information.

...

Kamvar et. al’s EigenTrust algorithm [21], which computes global trusts as a function of local trust values in a peer-to-peer network, is very similar to our probabilistic interpretation of trusts presented in section 4. One key difference is that we allow trusts and beliefs to vary; they are personalized for each user based on her personal trusts. In contrast, EigenTrust computes a global trust value (similar to PageRank) and emphasizes security against malicious peers who aim to disturb this calculation.

Both Advogato and The Eigentrust Algorithm rely on a trusted seed-set to defeat malicious rings:

A malicious collective is a group of malicious peers who know each other, who give each other high local trust values and give all other peers low local trust values in an attempt to subvert the system and gain high global trust values. We address this issue ... by having each peer place at least some trust in the peers P that are not part of a collective. Probabilistically, this isequivalent to saying that the agent that is crawling the network ... is less likely to get stuck crawling a malicious collective, because at each step, he has a certain probability of crawling to a pre-trusted peer.

Eigentrust uses "transaction ratings" as its signal. After downloading from a peer, the user rates the transaction.

each time peer i downloads a file from peer j, it may rate the transaction as positive (1) or negative (-1). Peer i may rate a download as negative, for example, if the file downloaded is inauthentic or tampered with, or if the download is interrupted. Like in the eBay model, we may define a local trust value sij as the sum of the ratings of the individual transactions that peer i has downloaded from peer j

This is a bit like reddit karma, but for download transactions instead of link submissions. The "karma" is then normalized to range 0-1 for each peer connection:

In order to aggregate local trust values, it is necessary to normalize them in some manner. Otherwise, malicious peers can assign arbitrarily high local trust values to other malicious peers, and arbitrarily low local trust values to good peers, easily subverting the system. We define a normalized local trust value, cij , as follows: cij = (max(sij , 0)) / Σj max(sij, 0) This ensures that all values will be between 0 and 1

This is pretty intuitive. Here's an example:

Bob: 5 karma from me
Alice: 10 karma from me
Charlie: -1 karma from me
Bob normalized = 5 / (5 + 10 + 0) = .333333
Alice normalized = 10 / (5 + 10 + 0) = 0.666667
Charlie normalized = 0 / (5 + 10 + 0) = 0

Managing Trust in a Peer-2-Peer Information System

Another work is that of Yu and Singh [8]. This model builds a social network among agents that supports participants' reputation both for expertise (providing service) and helpfulness (providing referrals). Every agent keeps a list of its neighbors, that can be changed over time, and computes the trustworthiness of other agents by updating the current values by testimonies obtained from reliable referral chains. After a bad experience with another agent every agent decreases the rating of the 'bad' agent and propagates this bad experience throughout the network so that other agents can update their ratings accordingly

...

to formulate the problem of managing trust in a de- centralized information system we can partition it now, more precisely, into three subproblems that need to be studied:

Global trust model: What is the model that describes whether an agent is trustworthy ? This model could be based on simple statistical methods, on experiences gained in economic and sociological sciences or on game-theoretic foundations.

Local algorithm to determine trust: What is the computational procedure that an agent can apply in order to determine trust under the limitations discussed above, which are the unreliability of the agents providing trust data both with respect to their trustworthiness themselves as well as their reachability over the network. To what extent does this local algorithm provide a good approximation of the global model?

Data and communication management: Can the local algorithm be implemented efficiently in a given data management infrastructure? In order to obtain scalable implementations the resources required by each agent must scale gracefully in the agent population size n, ideally as O(log n).

These guys handle reputation by assuming honesty, and then looking for complaints.

A Reputation-Based Approach for Choosing Reliable Resources in Peer-to-Peer Networks (the XRep alg)

A good observation on what can make p2p more hard than the Web:

screen shot 2015-09-25 at 7 49 41 pm

...

screen shot 2015-09-25 at 8 23 26 pm

Accuracy of Metrics for Inferring Trust and Reputation in Semantic Web-based Social Networks (TrustMail)

Uses very simple good/bad ratings in a social graph, and then defines a function for computing reputation in a situation where no direct edge is present.

From the survey paper, a good point about locally-computed trust (TrustMail, "Trust management for a semantic Web") vs globally-computed (EigenTrust)

In Massa and Avesani [56], the problem of controversial users (those who are both trusted and distrusted) is addressed. This work shows that the globally computed trust value (in a web of trust) for a controversial user may not be as accurate as a locally computed value due to the global disagreement on trust for that user. Golbeck and Hendler’s TrustMail also performs a local computation of reputation within a web of trust

Modeling and Evaluating Trust Network Inference

They deepen the data model, 1 by separating beliefs into domains of knowledge, and 2 by creating different kinds of trust.

screen shot 2015-09-26 at 12 21 43 pm

The papers often separate into three interlocking areas:

data model design
efficient computation, esp in p2p networks without global knowledge
analysis, eg using probabilistic models vs path-algebra

Propagation Models for Trust and Distrust in Social Networks (Appleseed)

Our major contribution to Semantic Web trust management through this work is twofold. First, we introduce a classification scheme for trust metrics along various axes and discuss advantages and drawbacks of existing approaches for Semantic Web scenarios. Hereby, we devise an advocacy for local group trust metrics, guiding us to the second part which presents Appleseed, our novel proposal for local group trust computation.

Their trust-metric classification is useful (their first contribution):

2.1.1. Network perspective. The first dimension influences semantics assigned to the values computed. Trust metrics may basically be subdivided into ones with global, and ones with local scope. Global trust metrics take into account all peers and trust links connecting them. Global trust ranks are assigned to an individual based upon complete trust graph information. Many global trust metrics, such as those presented in Kamvar, Schlosser and Garcia-Molina (2003), Guha (2003), and Richardson, Agrawal and Domingos (2003), borrow their ideas from the renowned PageRank algorithm Page et al. (1998) to compute web page reputation. The basic intuition behind the approach is that nodes should be ranked higher the better the rank of nodes pointing to them. Obviously, the latter approach works for trust and page reputation likewise.

Trust metrics with local scope, on the other hand, take into account personal bias. Interestingly, some researchers claim that only local trust metrics are “true” trust metrics, since global ones compute overall reputation rather than personalized trust (Mui, Mohtashemi and Halberstadt, 2002). Local trust metrics take the agent for whom to compute trust as an additional input parameter and are able to operate on partial trust graph information. The rationale behind local trust metrics is that persons an agent a trusts may be completely different from the range of individuals that agent b deems trustworthy. Local trust metrics exploit structural information defined by personalized webs of trust. Hereby, the personal web of trust for individual a is given through the set of trust relationships emanating from a and passing through nodes it trusts either directly or indirectly, as well as the set of nodes reachable through these relationships. Merging all webs of trust engenders the global trust graph. Local trust metrics comprise Levien’s Advogato trust metric (Levien and Aiken, 2000), metrics for modelling the Public Key Infrastructure (Beth, Borcherding and Klein, 1994; Maurer, 1996; Reiter and Stubblebine, 1997), Golbeck’s metrics for Semantic Web trust (Golbeck, Parsia and Hendler, 2003), and Sun Microsystems’s Poblano (Chen and Yeager, 2003). The latter work hereby strongly resembles Abdul-Rahman and Hailes Abdul-Rahman and Hailes (1997).

These guys advocate local computation. Their point above is, global computation can only rate reputation, but not trust. Since trust is a relative, that's intuitive. As I pointed out earlier, local computation is used in Advogato and Eigentrust to defeat malicious sets, and I think even PageRank (in practice at Google) uses its own trust-set to do the same.

Section 2.1.2 (computation locus) characterizes distributed vs centralized computation. By their definition, we're the latter -- we collect data to a local node and run all the trust computation there.

Partial trust graph exploration. Global metrics require a priori full knowledge of the entire trust network. Distributed metrics store trust values for all agents in the system, thus implying massive data storage demands. On the other hand, when computing trusted neighborhoods, the trust network only needs to be explored partially: originating from the trust source, one only follows those trust edges that seem promising, i.e., bearing high trust weights, and which are not too far away from the trust source. Inspection of personal, machine-readable homepages is thus performed in a just-in-time fashion. Hence, prefetching bulk trust information is not required.

Appleseed is a path-algebra based on real trust ratings. It's presented as an improvement over Advogadro's algorithm. They claim attack-resistance. They also suggest a mechanism for differentiating nodes as being trusted for their data, versus trusted for their recommendations of other users.

TRELLIS: An Interactive Tool for Capturing Information Analysis and Decision Making

Abstract. TRELLIS provides an interactive environment that allows users to add their observations, opinions, and conclusions as they analyze information by making semantic annotations about on-line documents. TRELLIS includes a vocabulary and markup language for semantic annotations of decisions and tradeoffs, and allows users to extend this vocabulary with domain specific terms or constructs that are useful to their particular task. To date, we have used TRELLIS with a variety of scenarios to annotate tradeoffs and decisions (e.g., military planning), organize materials (e.g., search results), analyze disagreements and controversies on a topic (e.g., intelligence analysis), and handle incomplete and conflicting information (e.g., genealogy research).

By the authors, 10 years later: User Interfaces with Semi-Formal Representations: a Study of Designing Argumentation Structures

/* going to add more here as I continue reading papers */ /* something else to explore is the relationship between users and authorities, when you introduce non-binary authorities. The https://en.wikipedia.org/wiki/HITS_algorithm explores this I think. A domain registrar or CA is an example authority -- what happens when you no longer give binary authority, but instead require consensus? */

pfrazee / scuttlebot-views