Open pfrazee opened 8 years ago
In the first iteration (current)...
whois
to search feeds and a whatis
to search blobs. Results are ranked by the number of links in matching documents that share a reference. Authorship is not yet ranked and a single document referencing the same link N times will add N positive signals.Regarding correctness, SSB's logs are typed & semantic, and so it seems like a pretty natural choice to code knowledge of types and index accordingly. Question is, how much flexibility do we lose when we do that? I expect to use a scheme like...
The current pipeline:
log → TF-IDF processing → TF-IDF index
query TF-IDF index → extract links → sort links by freq of target
Todo:
For semantic processing, I think the thing to do is normalize documents into a regular form for the TF-IDF index. This will enable field-boosting.
As an example, suppose we did title
, type
, text
, and ref
, with the following boost ranks:
{ title: 10, type: 5, text: 1, ref: 0 }
Then let's consider this message:
{
type: 'post',
text: 'hello @alice and @bob, how are you doing?',
mentions: [{ name: 'alice', link: REF1 }, { name: 'bob', link: REF2 }]
}
We can process this into 3 different normalized documents:
{ title: null, type: 'post', text: 'hello @alice and @bob, how are you doing?', ref: MSG_KEY }
{ title: 'alice', type: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF1 }
{ title: 'bob', type: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF2 }
Now, a query for 'alice' will match all 3, but REF1
will be much higher.
Open question: it'd be better if a query for 'alice' only hit REF1
and MSG_KEY
. Should the mention
documents not include the text of the post? Would that degrade the results significantly?
Another opportunity, might be to use attributes
in the normalized document, instead of type
. This allows a list of attributes, almost like keywords, which are condensed from the document. For instance, using the same example:
{ title: 10, attributes: 5, text: 1, ref: 0 }
{ title: null, attributes: ['post', 'mentions alice', 'mentions bob'], text: 'hello @alice and @bob, how are you doing?', ref: MSG_KEY }
{ title: 'alice', attributes: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF1 }
{ title: 'bob', attributes: 'mention', text: 'hello @alice and @bob, how are you doing?', ref: REF2 }
This way, a query of post mentions alice
would rank MSG_KEY
higher than REF1
graph analysis...
scoreStream
) for outputting the scores without sorting. Use that to remove an unneeded sort.some signals
intuitively...
Some use-cases that I want to fulfill, maybe in disco, or in a separate tool --
about
messages and mention
links.) But, this will also rely on general trustgraph knowledge.edit &SOMEBLOB.sha256
, where the blob ends up being a js file, should open sublime.Some of the stuff in http://pfraze.github.io/2015/03/26/web-of-trust-roundup.html and http://pfraze.github.io/2015/03/24/computing-trust.html relates to all this.
Some excerpts / thoughts, pulling from papers mentioned by A survey of trust in computer science and the Semantic Web
on transitive trust
A common assumption of most authentication protocols is that trust is transitive, i.e. the assumption that
(Alice trusts Bob) & (Bob trusts Cathy) ⇒ Alice trusts Cathy
. This is not generally true. We posit that transitivity may hold if certain conditions are met. We have termed this conditional transitivity and the conditions that may allow transitivity are (with reference to the example above): a) Bob explicitly communicates his trust in Cathy to Alice, as a ‘recommendation’. b) Alice trusts Bob as a recommender, i.e. recommender trust exists in the system. c) Alice is allowed to make judgements about the ‘quality’ of Bob’s recommendation (based on Alice’s policies). d) Trust is not absolute, i.e. Alice may trust Cathy less than Bob does, based on Bob’s recommendation.
PageRank is inferred from weblinks by interpreting them as transitive recommendation signals, in the domain of information-browsing. It would be incorrect to interpret SSB's links that way, because those links are not used to drive information-browsing, they are not used as recommendations, and they are not always transitive.
Proposes a model of Belief (in fact) statements and Trust (in user) statements. Uses real values (0-1). Rather than trying to get global knowledge of a trust graph, tries to use local knowledge from neighbors. Also, calculates trust from the perspective of an agent. Arguably, looking at generalizations of PageRank.
Some good snippets:
In this formulation, we consider a probabilistic interpretation of global belief combination. ... Imagine a random knowledge-surfer hopping from user to user in search of beliefs. At each step, the surfer probabilistically selects a neighbor to jump to according to the current user’s distribution of trusts. Then, with probability equal to the current user’s belief, it says “yes, I believe in the statement”. Otherwise, it says “no”. Further, when choosing which user to jump to, the random surfer will, with probability λi∈[0,1], ignore the trusts and instead jump directly back to the original user.
...
The web of trust calculation is not susceptible to “link-spamming,” a phenomenon in PageRank whereby a person may increase others’ trust in him by generating hundreds of virtual personas which all trust him. In PageRank, the uniform random jump of the surfer means that each virtual persona is bestowed some small amount of PageRank, which they ‘give’ to the spammer, thus increasing her rank. With a web of trust, this technique gains nothing unless the user is able to convince others to trust her virtual personas, which we expect will only occur if the personas actually provide useful information.
...
Kamvar et. al’s EigenTrust algorithm [21], which computes global trusts as a function of local trust values in a peer-to-peer network, is very similar to our probabilistic interpretation of trusts presented in section 4. One key difference is that we allow trusts and beliefs to vary; they are personalized for each user based on her personal trusts. In contrast, EigenTrust computes a global trust value (similar to PageRank) and emphasizes security against malicious peers who aim to disturb this calculation.
Both Advogato and The Eigentrust Algorithm rely on a trusted seed-set to defeat malicious rings:
A malicious collective is a group of malicious peers who know each other, who give each other high local trust values and give all other peers low local trust values in an attempt to subvert the system and gain high global trust values. We address this issue ... by having each peer place at least some trust in the peers P that are not part of a collective. Probabilistically, this isequivalent to saying that the agent that is crawling the network ... is less likely to get stuck crawling a malicious collective, because at each step, he has a certain probability of crawling to a pre-trusted peer.
Eigentrust uses "transaction ratings" as its signal. After downloading from a peer, the user rates the transaction.
each time peer i downloads a file from peer j, it may rate the transaction as positive (1) or negative (-1). Peer i may rate a download as negative, for example, if the file downloaded is inauthentic or tampered with, or if the download is interrupted. Like in the eBay model, we may define a local trust value sij as the sum of the ratings of the individual transactions that peer i has downloaded from peer j
This is a bit like reddit karma, but for download transactions instead of link submissions. The "karma" is then normalized to range 0-1 for each peer connection:
In order to aggregate local trust values, it is necessary to normalize them in some manner. Otherwise, malicious peers can assign arbitrarily high local trust values to other malicious peers, and arbitrarily low local trust values to good peers, easily subverting the system. We define a normalized local trust value, cij , as follows:
cij = (max(sij , 0)) / Σj max(sij, 0)
This ensures that all values will be between 0 and 1
This is pretty intuitive. Here's an example:
Bob: 5 karma from me
Alice: 10 karma from me
Charlie: -1 karma from me
Bob normalized = 5 / (5 + 10 + 0) = .333333
Alice normalized = 10 / (5 + 10 + 0) = 0.666667
Charlie normalized = 0 / (5 + 10 + 0) = 0
Another work is that of Yu and Singh [8]. This model builds a social network among agents that supports participants' reputation both for expertise (providing service) and helpfulness (providing referrals). Every agent keeps a list of its neighbors, that can be changed over time, and computes the trustworthiness of other agents by updating the current values by testimonies obtained from reliable referral chains. After a bad experience with another agent every agent decreases the rating of the 'bad' agent and propagates this bad experience throughout the network so that other agents can update their ratings accordingly
...
to formulate the problem of managing trust in a de- centralized information system we can partition it now, more precisely, into three subproblems that need to be studied:
- Global trust model: What is the model that describes whether an agent is trustworthy ? This model could be based on simple statistical methods, on experiences gained in economic and sociological sciences or on game-theoretic foundations.
- Local algorithm to determine trust: What is the computational procedure that an agent can apply in order to determine trust under the limitations discussed above, which are the unreliability of the agents providing trust data both with respect to their trustworthiness themselves as well as their reachability over the network. To what extent does this local algorithm provide a good approximation of the global model?
- Data and communication management: Can the local algorithm be implemented efficiently in a given data management infrastructure? In order to obtain scalable implementations the resources required by each agent must scale gracefully in the agent population size n, ideally as O(log n).
These guys handle reputation by assuming honesty, and then looking for complaints.
A good observation on what can make p2p more hard than the Web:
...
Uses very simple good/bad ratings in a social graph, and then defines a function for computing reputation in a situation where no direct edge is present.
From the survey paper, a good point about locally-computed trust (TrustMail, "Trust management for a semantic Web") vs globally-computed (EigenTrust)
In Massa and Avesani [56], the problem of controversial users (those who are both trusted and distrusted) is addressed. This work shows that the globally computed trust value (in a web of trust) for a controversial user may not be as accurate as a locally computed value due to the global disagreement on trust for that user. Golbeck and Hendler’s TrustMail also performs a local computation of reputation within a web of trust
They deepen the data model, 1 by separating beliefs into domains of knowledge, and 2 by creating different kinds of trust.
The papers often separate into three interlocking areas:
Our major contribution to Semantic Web trust management through this work is twofold. First, we introduce a classification scheme for trust metrics along various axes and discuss advantages and drawbacks of existing approaches for Semantic Web scenarios. Hereby, we devise an advocacy for local group trust metrics, guiding us to the second part which presents Appleseed, our novel proposal for local group trust computation.
Their trust-metric classification is useful (their first contribution):
2.1.1. Network perspective. The first dimension influences semantics assigned to the values computed. Trust metrics may basically be subdivided into ones with global, and ones with local scope. Global trust metrics take into account all peers and trust links connecting them. Global trust ranks are assigned to an individual based upon complete trust graph information. Many global trust metrics, such as those presented in Kamvar, Schlosser and Garcia-Molina (2003), Guha (2003), and Richardson, Agrawal and Domingos (2003), borrow their ideas from the renowned PageRank algorithm Page et al. (1998) to compute web page reputation. The basic intuition behind the approach is that nodes should be ranked higher the better the rank of nodes pointing to them. Obviously, the latter approach works for trust and page reputation likewise.
Trust metrics with local scope, on the other hand, take into account personal bias. Interestingly, some researchers claim that only local trust metrics are “true” trust metrics, since global ones compute overall reputation rather than personalized trust (Mui, Mohtashemi and Halberstadt, 2002). Local trust metrics take the agent for whom to compute trust as an additional input parameter and are able to operate on partial trust graph information. The rationale behind local trust metrics is that persons an agent a trusts may be completely different from the range of individuals that agent b deems trustworthy. Local trust metrics exploit structural information defined by personalized webs of trust. Hereby, the personal web of trust for individual a is given through the set of trust relationships emanating from a and passing through nodes it trusts either directly or indirectly, as well as the set of nodes reachable through these relationships. Merging all webs of trust engenders the global trust graph. Local trust metrics comprise Levien’s Advogato trust metric (Levien and Aiken, 2000), metrics for modelling the Public Key Infrastructure (Beth, Borcherding and Klein, 1994; Maurer, 1996; Reiter and Stubblebine, 1997), Golbeck’s metrics for Semantic Web trust (Golbeck, Parsia and Hendler, 2003), and Sun Microsystems’s Poblano (Chen and Yeager, 2003). The latter work hereby strongly resembles Abdul-Rahman and Hailes Abdul-Rahman and Hailes (1997).
These guys advocate local computation. Their point above is, global computation can only rate reputation, but not trust. Since trust is a relative, that's intuitive. As I pointed out earlier, local computation is used in Advogato and Eigentrust to defeat malicious sets, and I think even PageRank (in practice at Google) uses its own trust-set to do the same.
Section 2.1.2 (computation locus) characterizes distributed vs centralized computation. By their definition, we're the latter -- we collect data to a local node and run all the trust computation there.
Partial trust graph exploration. Global metrics require a priori full knowledge of the entire trust network. Distributed metrics store trust values for all agents in the system, thus implying massive data storage demands. On the other hand, when computing trusted neighborhoods, the trust network only needs to be explored partially: originating from the trust source, one only follows those trust edges that seem promising, i.e., bearing high trust weights, and which are not too far away from the trust source. Inspection of personal, machine-readable homepages is thus performed in a just-in-time fashion. Hence, prefetching bulk trust information is not required.
Appleseed is a path-algebra based on real trust ratings. It's presented as an improvement over Advogadro's algorithm. They claim attack-resistance. They also suggest a mechanism for differentiating nodes as being trusted for their data, versus trusted for their recommendations of other users.
Abstract. TRELLIS provides an interactive environment that allows users to add their observations, opinions, and conclusions as they analyze information by making semantic annotations about on-line documents. TRELLIS includes a vocabulary and markup language for semantic annotations of decisions and tradeoffs, and allows users to extend this vocabulary with domain specific terms or constructs that are useful to their particular task. To date, we have used TRELLIS with a variety of scenarios to annotate tradeoffs and decisions (e.g., military planning), organize materials (e.g., search results), analyze disagreements and controversies on a topic (e.g., intelligence analysis), and handle incomplete and conflicting information (e.g., genealogy research).
/* going to add more here as I continue reading papers */
/* something else to explore is the relationship between users and authorities, when you introduce non-binary authorities. The https://en.wikipedia.org/wiki/HITS_algorithm explores this I think. A domain registrar or CA is an example authority -- what happens when you no longer give binary authority, but instead require consensus? */
A general search program for SSB needs to solve