Ranking algorithm - Githubissues

Our aim? To create a good enough ranking algorithm for KhanaBot to rank (or sort, whichever verb you prefer) posts (that come up on newsfeed, submitted by users), reviews (that will by default appear only in the restaurant profile, and on the newsfeed if a review has enough "weight"), and ratings.

The below is my preliminary research on what I believe are the best in class ranking algorithms out there in public for our needs. The other ones that you want: Amazon, Tinder, Zomato algorithms - they are all kept secret and private by the companies for the very reason KhanaBot will need to keep it's own ranking algorithm private and will have to add upgrades and changes: people (specifically restaurants) will attempt to game the algorithm to increase their own visibility to users. With that in mind, let's move to the algorithms.

Let's take a look at Reddit's ranking algorithms in detail, and then look at how we can optimize KhanaBot's ranking algorithm given light of the info:

Reddit allows users to select different algorithms to sort their newsfeed. However, their default ranking algorithm for posts is different from the one they use for the comments on posts. Their default ranking algorithms for posts is called "hot".

Here is the algorithm for the Reddit hot ranking, written in half English, half pseudocode:

Input:
A, time the entry was posted, in seconds since UNIX epoch
U, number of upvotes
D, number of downvotes

Algorithm:
B = time at December 8, 2005, 07:46:43 AM, in seconds since UNIX epoch 
// an arbitrary comparison moment in time, selected probably because around that time they were reimplementing the redding hot ranking algorithm

timediff = abs(A - B) 
// the difference in seconds between the two points in time (which is actually what we need)

diff = U - D 
// difference between the upvotes and the downvotes

if (diff > 0) sign = 1;
if (diff == 0) sign = 0;
if (diff < 0) sign = -1;
// assign a value to sign variable, equal to sign of diff.

z = max {|diff|, 1}
// z is equal to the maximum value of the above two values
// this is so we don't add a number less than 1 to the log() function given below

weight = sign * log(z, 10)
// add weight to the rating based on the logarithm of the diff of the votes, and give sign to the weight so it is positive for U>D, and negative for U<D

time_weight = 45000
weighted_timediff = timediff / time_weight
// add weight to the timediff, divided by 45000 seconds (12.5 hours). The selection of the number 45000 is quite arbitrary, you can select something else.
// But know that increasing the value of time_weight increases the pressure on old posts to perform with exponential growth to compensate for its age to maintain its position. This is explained in more detail below.

rating_unrounded = weight + weighted_timediff
// The higher the timediff, the higher the ranking by default. So the newer the post, the higher its ranking, all other things being equal. And this score doesn't change (it doesn't decrease as time passes, but in comparision to newer posts, it has a lesser default score. So it needs to compensate for that using weight if it wants to stay up)

ranking = round(rating_unrounded, 7)
// round rating_unrounded to 7 decimal points

Output:
ranking, a floating point number rounded to the required precision (in the reference code, reddit used 7 decimal places as required precision)

It is quite a simple algorithm once you understand it. The understanding part takes the most time.

Anyway, let's see its implications:

timediff, aka submission time of the post, has the biggest impact on the ranking. The algorithm will rank newer posts higher than the older posts by default.
Newer posts get a higher timediff by default than older. This is a different approach than HN's algorithm, which decreases the ranking score as time passes. In both cases, you underweigh older posts and therefore force them to compensate for it using other ways. Here, it is weight.
In essence, the ranking increases as weight increases, which is the log of diff. So, to "keep up" with newer posts that start out with a higher default ranking because of higher timediff, the posts need to have diff increasing exponentially.
ifferent weights on timediff (plus rounding of the ranking) gives us different levels of pressure on the posts to perform.

The higher the weighing of the timediff (which is 45000 here), the slower the increase in the default value of timediff for newer posts. Here, what would happen is that after 24 hours, the default value of timediff increases by around 2 points. Which doesn't really make much difference in the storage size of ranking, as we are still storing 7 digit precision. What it makes a difference to, however, is how much weight is needed to overcome the "age disadvantage".

Say we have two posts, A and B. B is the new post, posted 45000 seconds later after A. This means that, if weight of A and B are equal, then ranking of B is 1 point more than ranking of A. So if A is to "keep up" its ranking, it needs to gain 1 point more to even the score between it and B. The only way to do that is to get 1 point from weight. As weight is a logarithmic function of diff to the base 10, the diff of A has to equal 10x the diff of B to have an equal score.

This means that, in general, to compensate for the elapsed time of 12.5 hours, a post has to 10x its current amount of diff to maintain its position compared to a newer post with its current diff score.

This has deeper implications. The higher the time_weight, the more the presure on posts to perform by increasing its diff exponentially.

Because we use log of diff, this has implications about what sorts of posts make it higher. Say there are two posts: A, a picture of a cat, and B, a 500 word criticism of something popular with people such as Steam gaming platform. Let's say A has 1000 upvotes and 900 downvotes. It is controversial. diff = 100. B has 100 upvotes, 0 downvotes. It is uncontroversial, but not as "useful" to the users as much as A is. If both were posted at the same time, both will have equal ranking, even though post A has benefited more people, even though it was more controversial. This means that posts that are lesser quality will come up to the top, as long as they are less controversial. One could theoretically correct for this (perhaps by adding more terms in the ranking function), but that will also increase the complexity of the function. It is a tradeoff.

Reference: [1] https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9 The main source of understanding, algorithm code and (incorrect) mathematical notation of the algorithm. Warning, the code and the mathematical notation of the hot algorithm differs in a crucial way. I selected the code because that seems to be the right implementation.

[2] https://news.ycombinator.com/item?id=231168 More understanding of the hot ranking algorithm.

The confidence sort is very interesting. So, why shouldn't you use other algorithms you know about for comment ranking?

Randall argues that using the naive ranking algorithm (ranking = upvotes - downvotes that reddit uses for top) for comments isn't that smart since it seems to be heavily biased towards comments posted early, because they are usually at the top, and people usually only read the comments at the top and the top comments' child comments, so their upvotes usually go to these comments only. The newer ones never see the light of day.

"reddit is heavily biased toward comments posted early. When a mediocre joke gets posted in the first hour a story is up, it will become the top comment if it's even slightly funny. [...] The reason for this bias is that once a comment gets a few early upvotes, it's moved to the top. The higher something is listed, the more likely it is to be read (and voted on), and the more votes a comment gets. It's a feedback loop that cements the comment's position, and a comment posted an hour later has little chance of overtaking it - even if people reading it are upvoting at a much higher rate." -- https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/

Using the hot algorithm for comments is not a good idea either. After all, comments don't need to "churn" over time and you don't need to have new comments at the top because people usually read a post's comments only once. We just want what we consider to be the "most relevant" or "best" commments on top, as decided by the community in question using their own feedback.

Sure, if you are talking about KhanaBot reviews, then you would find that using "hot" ranking may be better than "best", because you need to know the newer information. And yet, insightful reviews would fall down after a while.

"The idea was that it would make comments lose position after a certain time, but this led to even good comments dropping down to the bottom, and if you returned to the post in a day or two the ordering was completely nonsensical." -- https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/

"In a comment system you want ot rank the best comments highest regardless of their submission time." -- https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9

If the above line is true for your system, then the confidence ranking algorithm is for you.

And so we have what reddit calls the "best" sort. What you would better call the "confidence" ranking, since it is based on the "Wilson score confidence interval for a Bernoulli parameter".

So how does this confidence ranking algorithm work?

Let's start with the algorithm.

Input:
U, number of upvotes
D, number of downvotes

Algorithm:

// why?
if (U + D == 0):
  return 0

else:
  // n is total number of ratings
  n = U + D

  // This is the z-score
  // The value of z is calculated based on the confidence you want: so for 85% you have the given hardcoded value. For 95% you will have another value.
  z = 1.281551565545

  // ofpr is observed fraction of positive ratings
  ofpr = float(U) / n

  // first term of numerator of Wilson score confidence interval
  left = ofpr + (1/(2*n)) * (z ** 2)

  // second term of numerator of Wilson score confidence interval
  right = z * sqrt( (ofpr * (1 - ofpr) / n) + ((z**2) / 4 * (n ** 2)) )

  // denominator of Wilson score interval
  under = 1 + (1 / n * (z ** 2))

  // calculate the lower bound of the Wilson score confidence interval
  return (left - right) / under

Output:
confidence, a float from 0 to 1 that signifies the "ranking", that is, the lower bound of the fraction of positive ratings, that the algorithm is 85% sure it will get to eventually.

This formula is basically calculating the lower bound of the Wilson score confidence interval for a Bernoulli parameter.

You can't use this for a 5-star scale unless you convert those 5 star ratings into binary upvotes and downvotes.

There is obviously no effect of submission time here - it doesn't take any such input after all. Plus it is much better at responding to smaller samples of feedback which helps newer comments get to their "true" ranking (here defined by the fraction of positive ratings with respect to total ratings, eventually acheived) much faster.

The more votes, the closer the confidence ranking algorithm gets to more accurately estimating the actual ranking (as defined in the paragraph above).

There is a theory that a more accurate version of the confidence sort would be to also take into account all the people that declined to rate. I am unsure of this: how would you weight those non-ratings?

"Indeed, it may be more useful in a "top rated" list to display those items with the highest number of positive ratings per page view, download, or purchase, rather than positive ratings per rating. Many people who find something mediocre will not bother to rate it at all; the act of viewing or purchasing something and declining to rate it contains useful information about that item's quality." -- www.evanmiller.org/how-not-to-sort-by-average-rating.html

References for confidence ranking: www.evanmiller.org/how-not-to-sort-by-average-rating.html https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/ https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9

The consequence of a time based ranking algorithm (reddit, HN) is that there exist certain optimal submission times, where your article gets a boost compared to other articles.

It is easier to understand when you take into account the converse: "Wall-clock hours penalize an article even if no one is reading (overnight, for example). A time denominated in ticks of actual activity (such as views of the 'new' page, or even upvotes-to-all-submissions) might address this." -- https://news.ycombinator.com/item?id=1781013

"Without checking the actual numbers, consider a contrived example: Article A is submitted at midnight and 3 votes trickle in until 8am. Then at 8am article B is submitted. Over the next hour, B gets six votes and A gets 9 votes. (Perhaps many of those are duplicate-submissions that get turned into upvotes.) A has double the total votes, and 50% more votes even in the shared hour, but still may never rank above B, because of the drag of its first 8 hours." -- https://news.ycombinator.com/item?id=1781013

Another consequence of a time based ranking algorithm is that the flurry of activity right after the post is submitted is crucial.

"An article that misses its audience first time through - perhaps due to (1) [non optimal submission time] or a bad headline - may never recover, even with a later flurry of votes far beyond what new submissions are getting." -- https://news.ycombinator.com/item?id=1781013

The reddit inflation approach to create a decay is supposedly better than HN because it "plays nicely with database indexes", according to an HN user. I don't get what this means, but it seems important.

Ah, see, if you use an HN style decay over time of post ranking, you need to update the database with the new score of all the posts as time passes to decay it. Now that is inefficient.

iit2015143 / kbtestserver

Ranking algorithm #8