Open europe-asia-america opened 5 years ago
Report by tomorrow the way you will be approaching with algorithm you would want to use for each(review and post).
Ranking must be stored in database for less overhead.
The points we concluded are : 1 Rank only new post with respect to time 2 Observe a S.N.(singnificant no.) of upvotes and only for those upvotes, do the required ranking. 3 Get your intuition shift constants
Our aim? To create a good enough ranking algorithm for KhanaBot to rank (or sort, whichever verb you prefer) posts (that come up on newsfeed, submitted by users), reviews (that will by default appear only in the restaurant profile, and on the newsfeed if a review has enough "weight"), and ratings.
The below is my preliminary research on what I believe are the best in class ranking algorithms out there in public for our needs. The other ones that you want: Amazon, Tinder, Zomato algorithms - they are all kept secret and private by the companies for the very reason KhanaBot will need to keep it's own ranking algorithm private and will have to add upgrades and changes: people (specifically restaurants) will attempt to game the algorithm to increase their own visibility to users. With that in mind, let's move to the algorithms.
Let's take a look at Reddit's ranking algorithms in detail, and then look at how we can optimize KhanaBot's ranking algorithm given light of the info:
Reddit allows users to select different algorithms to sort their newsfeed. However, their default ranking algorithm for posts is different from the one they use for the comments on posts. Their default ranking algorithms for posts is called "hot".
Here is the algorithm for the Reddit hot ranking, written in half English, half pseudocode:
It is quite a simple algorithm once you understand it. The understanding part takes the most time.
Anyway, let's see its implications:
timediff
, aka submission time of the post, has the biggest impact on the ranking. The algorithm will rank newer posts higher than the older posts by default.Newer posts get a higher
timediff
by default than older. This is a different approach than HN's algorithm, which decreases the ranking score as time passes. In both cases, you underweigh older posts and therefore force them to compensate for it using other ways. Here, it isweight
.In essence, the ranking increases as
weight
increases, which is the log ofdiff
. So, to "keep up" with newer posts that start out with a higher default ranking because of highertimediff
, the posts need to havediff
increasing exponentially.ifferent weights on
timediff
(plus rounding of the ranking) gives us different levels of pressure on the posts to perform.The higher the weighing of the
timediff
(which is 45000 here), the slower the increase in the default value oftimediff
for newer posts. Here, what would happen is that after 24 hours, the default value oftimediff
increases by around 2 points. Which doesn't really make much difference in the storage size of ranking, as we are still storing 7 digit precision. What it makes a difference to, however, is how muchweight
is needed to overcome the "age disadvantage".Say we have two posts, A and B. B is the new post, posted 45000 seconds later after A. This means that, if
weight
of A and B are equal, thenranking
of B is 1 point more thanranking
of A. So if A is to "keep up" its ranking, it needs to gain 1 point more to even the score between it and B. The only way to do that is to get 1 point fromweight
. Asweight
is a logarithmic function ofdiff
to the base 10, thediff
of A has to equal 10x thediff
of B to have an equal score.This means that, in general, to compensate for the elapsed time of 12.5 hours, a post has to 10x its current amount of
diff
to maintain its position compared to a newer post with its currentdiff
score.This has deeper implications. The higher the
time_weight
, the more the presure on posts to perform by increasing itsdiff
exponentially.diff
, this has implications about what sorts of posts make it higher. Say there are two posts: A, a picture of a cat, and B, a 500 word criticism of something popular with people such as Steam gaming platform. Let's say A has 1000 upvotes and 900 downvotes. It is controversial.diff
= 100. B has 100 upvotes, 0 downvotes. It is uncontroversial, but not as "useful" to the users as much as A is. If both were posted at the same time, both will have equalranking
, even though post A has benefited more people, even though it was more controversial. This means that posts that are lesser quality will come up to the top, as long as they are less controversial. One could theoretically correct for this (perhaps by adding more terms in theranking
function), but that will also increase the complexity of the function. It is a tradeoff.Reference: [1] https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9 The main source of understanding, algorithm code and (incorrect) mathematical notation of the algorithm. Warning, the code and the mathematical notation of the hot algorithm differs in a crucial way. I selected the code because that seems to be the right implementation.
[2] https://news.ycombinator.com/item?id=231168 More understanding of the hot ranking algorithm.
The confidence sort is very interesting. So, why shouldn't you use other algorithms you know about for comment ranking?
Randall argues that using the naive ranking algorithm (
ranking = upvotes - downvotes
that reddit uses fortop
) for comments isn't that smart since it seems to be heavily biased towards comments posted early, because they are usually at the top, and people usually only read the comments at the top and the top comments' child comments, so their upvotes usually go to these comments only. The newer ones never see the light of day."reddit is heavily biased toward comments posted early. When a mediocre joke gets posted in the first hour a story is up, it will become the top comment if it's even slightly funny. [...] The reason for this bias is that once a comment gets a few early upvotes, it's moved to the top. The higher something is listed, the more likely it is to be read (and voted on), and the more votes a comment gets. It's a feedback loop that cements the comment's position, and a comment posted an hour later has little chance of overtaking it - even if people reading it are upvoting at a much higher rate." -- https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/
Using the hot algorithm for comments is not a good idea either. After all, comments don't need to "churn" over time and you don't need to have new comments at the top because people usually read a post's comments only once. We just want what we consider to be the "most relevant" or "best" commments on top, as decided by the community in question using their own feedback.
Sure, if you are talking about KhanaBot reviews, then you would find that using "hot" ranking may be better than "best", because you need to know the newer information. And yet, insightful reviews would fall down after a while.
"The idea was that it would make comments lose position after a certain time, but this led to even good comments dropping down to the bottom, and if you returned to the post in a day or two the ordering was completely nonsensical." -- https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/
"In a comment system you want ot rank the best comments highest regardless of their submission time." -- https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9
If the above line is true for your system, then the confidence ranking algorithm is for you.
And so we have what reddit calls the "best" sort. What you would better call the "confidence" ranking, since it is based on the "Wilson score confidence interval for a Bernoulli parameter".
So how does this confidence ranking algorithm work?
Let's start with the algorithm.
This formula is basically calculating the lower bound of the Wilson score confidence interval for a Bernoulli parameter.
You can't use this for a 5-star scale unless you convert those 5 star ratings into binary upvotes and downvotes.
There is obviously no effect of submission time here - it doesn't take any such input after all. Plus it is much better at responding to smaller samples of feedback which helps newer comments get to their "true" ranking (here defined by the fraction of positive ratings with respect to total ratings, eventually acheived) much faster.
The more votes, the closer the confidence ranking algorithm gets to more accurately estimating the actual ranking (as defined in the paragraph above).
There is a theory that a more accurate version of the confidence sort would be to also take into account all the people that declined to rate. I am unsure of this: how would you weight those non-ratings?
"Indeed, it may be more useful in a "top rated" list to display those items with the highest number of positive ratings per page view, download, or purchase, rather than positive ratings per rating. Many people who find something mediocre will not bother to rate it at all; the act of viewing or purchasing something and declining to rate it contains useful information about that item's quality." -- www.evanmiller.org/how-not-to-sort-by-average-rating.html
References for confidence ranking: www.evanmiller.org/how-not-to-sort-by-average-rating.html https://redditblog.com/2009/10/15/reddits-new-comment-sorting-system/ https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9
The consequence of a time based ranking algorithm (reddit, HN) is that there exist certain optimal submission times, where your article gets a boost compared to other articles.
It is easier to understand when you take into account the converse: "Wall-clock hours penalize an article even if no one is reading (overnight, for example). A time denominated in ticks of actual activity (such as views of the 'new' page, or even upvotes-to-all-submissions) might address this." -- https://news.ycombinator.com/item?id=1781013
"Without checking the actual numbers, consider a contrived example: Article A is submitted at midnight and 3 votes trickle in until 8am. Then at 8am article B is submitted. Over the next hour, B gets six votes and A gets 9 votes. (Perhaps many of those are duplicate-submissions that get turned into upvotes.) A has double the total votes, and 50% more votes even in the shared hour, but still may never rank above B, because of the drag of its first 8 hours." -- https://news.ycombinator.com/item?id=1781013
Another consequence of a time based ranking algorithm is that the flurry of activity right after the post is submitted is crucial.
"An article that misses its audience first time through - perhaps due to (1) [non optimal submission time] or a bad headline - may never recover, even with a later flurry of votes far beyond what new submissions are getting." -- https://news.ycombinator.com/item?id=1781013
The reddit inflation approach to create a decay is supposedly better than HN because it "plays nicely with database indexes", according to an HN user. I don't get what this means, but it seems important.
Ah, see, if you use an HN style decay over time of post ranking, you need to update the database with the new score of all the posts as time passes to decay it. Now that is inefficient.