ppy / osu

rhythm is just a *click* away!

https://osu.ppy.sh

MIT License

15.1k stars 2.24k forks source link

Rework global user accuracy metric #4680

Closed Slayer95 closed 3 years ago

Slayer95 commented 5 years ago

Motivation

As per my understanding, currently the accuracy of a user is calculated as a weighted average of the accuracies of the top plays, using the same weighting system pp does. This concept has several issues:

The average is not a robust measure of central tendency.

This means that an outlier accuracy reaching the top plays would affect heavily the resulting value, either boosting it or reducing it sharply.

The weighting system heavily biases the result towards the accuracy of the very top pp plays. However, the relationship of the pp gained from a play with the difficulty of the map is non-trivial, rendering the weights used questionable.

After successfully passing a map beyond the player's current comfort zone, it could turn out to be a top-pp play with low accuracy. In this situation which is supposed to be joyful for the player, the global accuracy of the user will drop sharply, resulting in a bad™ user experience.

It's hard to leverage to compare the skill of multiple players.

Example: if player A has 98% accuracy and player B has 95% accuracy, who is better given that player A is at 500pp and player B at 800pp? The higher accuracy of player A sheds a shadow on the higher ranking of player B, because player A could "sacrifice" his accuracy by playing harder maps in order to get PP. However, in the end the interpretation of the higher accuracy is just a suspicion, and not an actual reliable conclusion.

Proposal

I have three proposals to fix these three issues. Each of them more or less builds upon the previous one.

Switch the calculation from using a weighted average to using the median of the relevant accuracies, from the top pp plays (the median is a robust measure of central tendency.)
The displayed user accuracy should be a compound metric of 2 (or more) values which represents the performance of the user over different map difficulty brackets.

In particular, I am proposing to split the top pp plays in two sets, according to the maximum pp obtainable (or star rating?) on each map (with the respective used mods).

Hence, the accuracy would be displayed as a ➜ b, where a is the median of the accuracies of the plays of easiest maps, and b is the median of the accuracies of the plays of the hardest maps. Note that the value a would represent the accuracy for maps within the player's comfort zone, while b would represent the accuracy for maps which challenge the player's skill.

In this context, a ➜ b is to be read roughly as "a accuracy in comfortable maps, with b accuracy in harder maps".

Establish manageable pp cutoffs, such as 100 * 2^n. Then, the accuracy becomes a list of medians for each pp range associated to map difficulty. In most places where it would be displayed the accuracy would be shown as just the two medians of the highest difficulty sets. However, there would be a place in the user profile where the full value would be listed in order to compare several players.

Slayer95 commented 5 years ago

So I demoed proposals 1 and 2 in osu-accuracy, and I queried what the results would be for a few osu! players (*1).

osu! username	pp, diff. range (*4)	Stable acc.	Median acc.	Range acc.
BeowulF97	110 [6, 48]	81.46%	78.74%	77.02% -> 79.32%
Ego MS	314 [6, 60]	90.02%	91.35%	94.72% -> 88.25%
gianluca01	752 [20, 135]	91.15%	91.82%	93.72% -> 89.66%
Moonbeam (*3)	905 [23, 69]	99.24%	99.55%	100.00% -> 98.62%
IceSandslash (*2)	1201 [44, 197]	92.44%	93.27%	95.93% -> 90.90%
Jiandae	1553 [31, 219]	92.95%	93.84%	95.68% -> 91.69%
iRedak-	3369 [108, 166]	99.99%	100.00%	100.00% -> 100.00%
AngelJuega1	3578 [122, 455]	93.87%	94.36%	95.38% -> 92.74%
XinCrin	3791 [142, 291]	98.20%	98.20%	98.99% -> 97.50%
Angelxx12	4597 [180, 716]	92.63%	93.99%	95.22% -> 92.51%
nathan on osu	14692 [609, 1068]	98.81%	99.23%	99.51% -> 98.69%

Observations

osu! username	Observation
BeowulF97	Stable value is outside range. They have very few scores above 80%. The fact that stable yields 81.5% showcases how the current system is biased.
Ego MS	Range accuracy suggests a single representative accuracy of 91.485% (average of both values), which is quite far from stable's value.
gianluca01	In this case, stable isn't as bad as in the previous case, but median acc. still beats it.
Moonbeam (*3)	Range accuracy clearly shows how this player often gets 100% plays in their comfort zone.
IceSandslash (*2)	Ditto
Jiandae	Stable is terribly biased, the higher bound is twice as further as the lower bound.
iRedak-	Yeah, this will happen for those who often play with "Perfect"... 0.01% is not really a big deal (and stable may round up too anyway?)
AngelJuega1	Stable is actually closer to the average of the bounds, but I still trust the median better.
XinCrin	Neat
Angelxx12	The bias is terrible terrible: (hi-stable)/(stable-lo) > 20
nathan on osu	Median value is 0.42% higher: nathan deserves this

(1) This is not a representative sample. It's just meant to showcase the proposal. (2) Full disclosure: this is myself. (3) EDIT: Added Moonbeam (4) EDIT2: Added approximate difficulty ranges of their top plays, considering mods, as their max pp obtainable.

camellirite commented 5 years ago

I like the idea of median, it's simple and makes more sense than weighting accuracy. "Range acc" is something I don't quite understand, would it make accuracy always have 2 values? Having 2 different numbers for accuracy is kinda weird, not sure I agree with that.

Is there an issue with using the average without weighting? That's another way I'd consider, though I haven't thought about it too much, it could be flawed.

Would be cool if other people chipped in their opinion too, see if anyone wants to keep the weighted acc or exchange it for a different system.

holly-hacker commented 5 years ago

Why median over average? I understand why you want to get rid of weighted average, but just lowering the weigh of top plays or making them all have the same weight seems better to me. Median gives an objectively wrong result for players like iRedak-, it literally ignores everything but the middle score (you could have 51% of your top plays be SS and the rest be 70% passes, and still have 100% accuracy).

Slayer95 commented 5 years ago

Is there an issue with using the average without weighting? That's another way I'd consider, though I haven't thought about it too much, it could be flawed.

Why median over average? I understand why you want to get rid of weighted average [...].

The arithmetic mean is too susceptible to outlier values. Let's show this with in an example:

Example

New player installed osu! and played a low-end Hard map, and got 95%. "too ez", he says, I'll play Insane. New player plays 5 Insane maps, getting accuracy of ~70%. Later on, in social networks: > New Player: Hey! I finally installed the game you talked me about, osu" Old Player: Oh, really? That's great! Old Player: So, how good are you? Your accuracy? New Player: Well, I've been getting around 70%. (Old Player: lol that's so bad, well whatever) Old Player: Let's have some multiplayer matches *(Old Player gets online, searches for their friend)* *(Old Player reads in their profile: ``Accuracy: 74.17%``, oh I guess they are not that that bad?)* *Insane games played together. Accuracies: 70%, 71%, 68%, 69%, 72%, 70%*.

The value of 74.1% above was calculated with an unweighted average (5*70+95)/6. For comparison, the median would be 70%. When they play together, the median value turns out to be a much better predictor of their typical accuracy. This is the large effect of the outlier 95% accuracy from the first easy game.

tl;dr: the average is too vulnerable to outlier values. This issue is exacerbated by the current weighting system used, but in the end it's intrinsic to the arithmetic mean.

Slayer95 commented 5 years ago

but just lowering the weigh of top plays or making them all have the same weight seems better to me. Median gives an objectively wrong result for players like iRedak-, it literally ignores everything but the middle score (you could have 51% of your top plays be SS and the rest be 70% passes, and still have 100% accuracy).

I understand what you are talking about, but first allow me to note that iRedak- is not an example of that. See the distribution of their plays.

Perfect mod plays graph

In contrast, the players that are really concerning are those with this kind of play distribution:

Dual difficulty plays graph

iRedak- is someone I'd categorize as a "Perfect mod player", while the second dataset corresponds to what I call a "dual difficulty player".

Arguably, the median provides a better predictor for a typical accuracy of a "Perfect mod player" (100%) -though some may disagree, feel free to do over a value difference of 0.01%.

As for "dual difficulty players", my proposal number 2 exists in order to provide a better tool for them. We need a better estimate because of another of the issues the arithmetic mean has: the result may not be one of the data points. In the dataset for the dual difficulty player above, the average unweighted accuracy is 86.12%, which is absolutely not a typical accuracy for their plays.

So how does proposal 2 fix this? (A visual example should help so that @camellirite understands, too)

The displayed user accuracy should be a compound metric of 2 (or more) values which represents the performance of the user over different map difficulty brackets.

What are these "different map difficulty brackets"? See:

Dual difficulty plays graph split

So we take the median of the lower difficulty set (i.e. lower map max pp), and the median of the higher difficulty set. Those turn out to be 98.34% and 74.38%. By showing it as 98.34% ➜ 74.38%, it means that as they progress to higher difficulties, their typical accuracy is the second value.

tl;dr: while the median yields a bad value for EXTREME dual difficulty players, the average does too; proposal 2: "range accuracy" fixes this issue, while also improving on the analysis of common players which are partly dual-difficulty players too.

holly-hacker commented 5 years ago

New player installed osu! and played a low-end Hard map, and got 95%. "too ez", he says, I'll play Insane. New player plays 5 Insane maps, getting accuracy of ~70%.

You generally don't pass Insane difficulties the day you install osu!. After the player has been playing for a while, they will find the difficulty range they'll consistently set their top scores in (which in the current PP meta are exclusively FCs or near FCs) and the accuracy for their top scores won't fluctuate as much. There is no value in accounting for a player that only played 6 times.

My argument for iRedak- was that his profile would display a 100% accuracy rating, which is simply wrong. The example I have with 51% SSs was to illustrate it with an extreme case. As I said, median ignores everything except for the middle value. That is a lot of lost information.

Splitting accuracy up in 2 values is simply a bad idea. Not many players will have their top plays neatly separated in 2 partitions. As for the people where it does, their natural growth will screw up these partitions anyway, as they set new scores that move their old ones down. Then you will be mixing the old low-acc plays with new high-acc plays, and your accuracy value for these high-acc plays would lose its intended meaning.

So in the end, I don't think median is a good metric to replace (weighted) accuracy. There may be other metrics that are better (perhaps an unweighted average, or an average weighted on the age of the score), but it has to stay simple, intuitive and correct.

However, I do like the idea of displaying accuracy in more detail. One way could be showing the values of a boxplot (median at 0%, 25%, 50%, 75% and 100%, ignoring statistical outliers) so a visitor can tell the distribution of your accuracy. This would probably be hidden in a tooltip, though, as it is pretty verbose.

These are just my opinions, others may disagree with me.

Slayer95 commented 5 years ago

@HoLLy-HaCKeR

You generally don't pass Insane difficulties the day you install osu!. After the player has been playing for a while, they will find the difficulty range they'll consistently set their top scores in (which in the current PP meta are exclusively FCs or near FCs) and the accuracy for their top scores won't fluctuate as much. There is no value in accounting for a player that only played 6 times.

That's a fair objection to the example I used. However, it still stands that averages are not robust measures of location, weighted or not.

My argument for iRedak- was that his profile would display a 100% accuracy rating, which is simply wrong.

Arguably, the median provides a better predictor for a typical accuracy of a "Perfect mod player" (100%) -though some may disagree, feel free to do over a value difference of 0.01%.

I believe that claiming that the accuracy of a dedicated Perfect mod player is 100% is not wrong, but very right. It ignores that one time they got 99.7% because they forgot to turn the mod on, but so what? That's better, because that value is a clear outlier.

Also, please allow me to respectfully point out that you may be affected by a cognitive bias, and therefore rejecting a result of 100% because it's the maximum possible value and it would imply some sort of perfection from an (accidentally) almost perfect record.

Splitting accuracy up in 2 values is simply a bad idea. Not many players will have their top plays neatly separated in 2 partitions. As for the people where it does, their natural growth will screw up these partitions anyway, as they set new scores that move their old ones down. Then you will be mixing the old low-acc plays with new high-acc plays, and your accuracy value for these high-acc plays would lose its intended meaning.

I am not sure I follow you. Of course, top plays will often be a single aggregate of accuracy values. However, it will always be possible to split them in two sets according to whether they are lower or higher than the median difficulty value (map max pp).

As for the people where it does, their natural growth will screw up these partitions anyway, as they set new scores that move their old ones down.

By using the top pp plays filtering, old scores of the same map will just be removed from the dataset.
As for the player getting better and improving on some maps but not yet surpassing the scores of other maps..., I don't think we can do better? Are you proposing deleting the old scores or something? That seems orthogonal and non-excluyent to any of my proposals and the status quo.
If the player gets better and starts playing on harder maps, that's something range accuracy handles far better imo, as explained in the OP.

However, I do like the idea of displaying accuracy in more detail. One way could be showing the values of a boxplot (median at 0%, 25%, 50%, 75% and 100%, ignoring statistical outliers) so a visitor can tell the distribution of your accuracy. This would probably be hidden in a tooltip, though, as it is pretty verbose.

Boxplots are nice, though they will be hard to understand in countries where basic education in Statistics doesn't cover them (like mine). However, while they are useful in many cases, we must consider this exact use case. So what information would a boxplot provide us?

Q0: The lowest accuracy recorded. Is this useful? I don't think so, specially given that the sampled dataset is only the top plays, and not the full play history. Even if we consider the full play history, this is just a shameful number.
Q4: The maximum accuracy recorded. Is this useful? It has great anecdotical value. However, just like Q0, its usefulness is diminished by the sampling process. If we consider the full play history, it's pretty much guaranteed to be 100% for a majority of players.
Q1 and Q3: The interquartile range. This is very nice information. My range accuracy proposal is inspired by it, and I think it's better for this case because it includes an interpretation regarding where the bounds come from: the dataset is not originally homogeneous, so splitting it yields parameters easier to understand.
Q2: This is just the median. Its value is that it's the only measure of location explicitly shown in a boxplot. However, in fact Q1 and Q3 already have an implicit measure of location, which is the midhinge, their average.

Based on these fundamentals, it's also possible to build a boxplot from the "range accuracy" parameter:

Q0': Lowest accuracy from the sample (equally useless as the original boxplot).
Q1': Median of the hardest-maps partition of top plays.
Q2': Average of Q1 and Q3.
Q3': Median of the easiest-maps partition of top plays.
Q4': Highest accuracy from the sample (equally useless as the original boxplot).

Incidentally, I have already calculated what the quartiles would be for your boxplot proposal.

Median acc.	Range acc.	Quartiles
80.60%	80.35% -> 80.74%	62.07%, 76.26%, 80.60%, 84.52%, 91.82%
91.49%	94.72% -> 88.25%	77.04%, 88.25%, 91.49%, 94.97%, 100.00%
91.82%	93.72% -> 89.66%	75.56%, 88.37%, 91.82%, 94.19%, 100.00%
99.55%	100.00% -> 98.62%	95.26%, 98.58%, 99.55%, 100.00%, 100.00%
93.27%	95.93% -> 90.90%	60.20%, 90.22%, 93.27%, 96.25%, 98.19%
93.84%	95.68% -> 91.69%	81.55%, 90.30%, 93.84%, 95.73%, 100.00%
100.00%	100.00% -> 100.00%	99.50%, 100.00%, 100.00%, 100.00%, 100.00%
94.36%	95.38% -> 92.74%	80.08%, 92.46%, 94.36%, 96.36%, 99.69%
98.20%	98.99% -> 97.50%	93.59%, 97.37%, 98.20%, 99.15%, 100.00%
93.99%	95.22% -> 92.51%	74.43%, 91.15%, 93.99%, 95.65%, 98.57%
99.23%	99.51% -> 98.69%	88.38%, 98.39%, 99.23%, 99.63%, 100.00%

As you can see, the ranges I calculate have a smaller width than the interquartile range -the IQR always spans over 50% of the values-, therefore providing a better sense of location.

EDIT: Added Moonbeam to the table.

holly-hacker commented 5 years ago

That's a fair objection to the example I used. However, it still stands that averages are not robust measures of location, weighted or not.

Why does it matter whether it is robust or not? If a player sets a score with a really low or really high accuracy, can they not see their accuracy rise or drop by a bit? Keep in mind that we're only expecting to see values between about 80% and 100% in a user's top plays. The outliers this may produce are not so significant that the global accuracy becomes inaccurate.

I believe that claiming that the accuracy of a dedicated Perfect mod player is 100% is not wrong, but very right. It ignores that one time they got 99.7% because they forgot to turn the mod on, but so what? That's better, because that value is a clear outlier. Also, please allow me to respectfully point out that you may be affected by a cognitive bias, and therefore rejecting a result of 100% because it's the maximum possible value and it would imply some sort of perfection from an (accidentally) almost perfect record.

The accuracy display does not exist to show what the player wants it to be, but rather what the player actually achieves. If a PF player sets a score that's not 100%, their accuracy will drop. They will have to work to get it back to 100% by either setting better scores so it rounds up, or by improving the play itself. That's how the life of a PF player is. I don't believe PF players would want to change it.

And this has nothing to do with cognitive biases. 100% accuracy implies that it is higher than 99.995%, not that 51% of your top plays are SSes.

I am not sure I follow you. Of course, top plays will often be a single aggregate of accuracy values. However, it will always be possible to split them in two sets according to whether they are lower or higher than the median difficulty value (map max pp).

I meant to say that just splitting your plays in 2, the accuracy values don't really have any meaning, they just show the upper and lower half of your plays. If you're going to split them up, you can just as well split them up into multiple values like I said with the boxplot idea.

As for that idea, that's exactly what it was: just an idea. Of course Q0 and Q4 are useless in this case, but you can replace them with 15% and 85% medians. I wouldn't use midhinge, it doesn't add any value and median seems fine in this case.

Slayer95 commented 5 years ago

Why does it matter whether it is robust or not? If a player sets a score with a really low or really high accuracy, can they not see their accuracy rise or drop by a bit?

Since those scores would be outliers, they ought to be ignored. This issue is a matter of principle, so if you don't concur, let's just agree to disagree.

Keep in mind that we're only expecting to see values between about 80% and 100% in a user's top plays. The outliers this may produce are not so significant that the global accuracy becomes inaccurate.

That's a good point once the top plays list enters in a length=100 steady state. However, if we go back to my earlier example, those outliers will be significant for at least some new players.

Keep in mind that the health of a game is inextricably linked to the influx of new players, and -in general- neglecting them may be dangerous (yes, this is probably an overstatement applied specifically to the Accuracy metric, but this is to show my disagreement with your claim that there is no value in accounting for a player that only played 6 times, given that the population is the user's plays, not the plays of the whole playerbase)

Since we can provide a better metric for these players, and in the rest of cases (steady state) the average and the median won't be far, then I'd say we'd rather do provide it than don't.

The accuracy display does not exist to show what the player wants it to be, but rather what the player actually achieves.

Implying that using the median is "what the player wants it to be" is a mischaracterization. Good metrics, i.e. "statistics", should work as estimators. In this case, the metric used should be able to predict what the accuracy of future plays would be. That's conceptually very different from "what the player wants it to be", but it coincides with it because of osu!'s providing the Perfect mod.

Due to its better fitness for that purpose, the median is arguably superior to the arithmetic mean.

And this has nothing to do with cognitive biases.

Since you insist, I retract that statement.

100% accuracy implies that it is higher than 99.995%, not that 51% of your top plays are SSes.

I may be very thick-skinned and claim that both of those thresholds seem equally arbitrary to me, but let's not do that.

As I have already mentioned, the range accuracy metric is intended to fix the "51% issue". Since it's compounded of the medians of two partitions of the dataset, theory states that the threshold for a result of 100% ➜ 100% would be between 50% and 75%. If the partition split is near the median of the original dataset, said threshold will approach 75%. In practice, that would be the case for players which are extreme dual-difficulty player and have many SS scores.

Hence, with range accuracy, such 100% ➜ 100% result would require 75% of SS. In my opinion, that's (very barely) aceptable, though you or others may very rightfully disagree.

I meant to say that just splitting your plays in 2, the accuracy values don't really have any meaning, they just show the upper and lower half of your plays.

You just stated what the meaning of the accuracy values is.

If you're going to split them up, you can just as well split them up into multiple values like I said with the boxplot idea.

I am not extraneous to that idea. In fact, that's proposal number 3. However, I am not really sold on it because using many parameters is:

Overkill in the treatment of accuracy.
Costly to the server performance and storage.
Costly in development / maintenance time, specially considering the support of the osu! API.
It might not even provide a deep insight on the performance of the user. We already have pp!

I wouldn't use midhinge, it doesn't add any value and median seems fine in this case.

Yeah. I wasn't proposing to arbitrarily change the standard boxplot by using the midhinge rather than the median for its center. What I meant to do was to explain how a different graph could be made by averaging the range accuracy bounds.

With this, I think we should be more or less clear on what our disagreements are. Therefore, I will (probably) refrain from discussing these points further to avoid noise in this thread.

Thanks for the attention so far!

holly-hacker commented 5 years ago

Implying that using the median is "what the player wants it to be" is a mischaracterization. Good metrics, i.e. "statistics", should work as estimators. In this case, the metric used should be able to predict what the accuracy of future plays would be. That's conceptually very different from "what the player wants it to be", but it coincides with it because of osu!'s providing the Perfect mod.

To me, the purpose of the display is to show the accuracy of the player in past scores (like top plays), not how they will perform in the future. For this reason median would not be fit, since it ignores outliers. In my opinion they should not be ignored.

I am not extraneous to that idea. In fact, that's proposal number 3. However, I am not really sold on it because using many parameters is:

Overkill in the treatment of accuracy.

Costly to the server performance and storage.

Costly in development / maintenance time, specially considering the support of the osu! API.

It might not even provide a deep insight on the performance of the user. We already have pp!

The boxplot (or other extra metrics) can be calculated on the client from the top plays of the profile, and calculating it isn't expensive at all if only done on the top 100 plays. A userscript could probably do it. Takes maybe 5 minutes to implement.

There is indeed not much else to discuss, I believe all this comes down to preference in the end. Let's see what input others have to give. I personally don't see a need for any change.

abraker95 commented 5 years ago

Thoughts

Edge cases are not likely to occur, but they are still cases we expect some kind of preferred result on. I don't care how little you think they might matter, whether you don't think it shouldn't be that robust, or whether it's overkill or not. The expectation of a system functioning under anything you throw at it in the way desired is what we should aim for and nothing less.

Median is prone to not change if you replace values or if there are a bunch of same values where the median happens to be. I do expect the value representing the series of accuracies the player has to change in response to those cases. Therefore, I do not think the median is a good way to represent the accs the player has. While the average solves both issues the median has, it is prone to have an undesired response to extreme outliers, which I do believe should be trimmed.

With personal bias out of the way, since the argument ended on a matter of preference, how about averaging a certain percent of values within the median? The percentage value itself allows to control how much median-like or average-like the representative acc value is, where 0% is equivalent to the median of the entire dataset and 100% is equivalent to the average of the entire dataset.

Code:

accs = [ ]  # list of accs the player has
percent =   # percent of dataset within median to take

upper_idx = floor(0.5*(len(accs) + percent*len(accs)))
lower_idx = floor(0.5*(len(accs) - percent*len(accs)))
representative_acc = average(accs[lower_idx : upper_idx])

Demo here: https://repl.it/@_abrakerabraker/Accs-demo

Some numbers:

( p = percent the representative acc is average-like )

Player	p = 0%	p = 20%	p = 40%	p = 60%	p = 80%	p = 100%
BeowulF97	82.14%	82.28%	82.36%	82.40%	81.91%	82.26%
Ego MS	91.92%	92.00%	92.12%	92.21%	92.26%	92.20%
gianluca01	92.20%	92.04%	92.07%	91.97%	91.83%	91.65%
Moonbeam	99.62%	99.62%	99.59%	99.51%	99.43%	99.23%
IceSandslash	93.65%	93.69%	93.66%	93.64%	93.25%	92.58%
Jiandae	94.37%	94.38%	94.37%	94.31%	94.25%	94.03%
iRedak-	100.0%	100.0%	100.0%	100.0%	100.0%	99.99%
AngelJuega1	94.85%	94.7%	94.67%	94.69%	94.7%	94.47%
XinCrin	98.34%	98.32%	98.34%	98.33%	98.28%	98.12%
Angelxx12	94.49%	94.5%	94.42%	94.29%	94.14%	93.88%
nathan on osu	99.27%	99.27%	99.22%	99.15%	99.02%	98.76%

Notes/Comments:

Some representative accs fluctuate up and down as they become more average-like. This is totally expected since as more values are included. The lower or upper ends can "tip the scale", so to speak (median being the center).

Acc values from players' top 100 pp scores are used and you can confirm the median in the demo provided simply by looking at the 50th value in acc list like so (they are printed in rows of 10):

The code does not fully represent the median (p=0%). Instead of taking the average of the two middle values, it takes the one located at floor(len(dataset)/2). That should be non issue as per demonstration, and if it is, it's probably due to value choosing, in which case that is more the reason why not to use the median.

That said, while most of @Slayer95's values for median match about what I got, the median for BeowulF97's acc left me confused (reference pic right before this). The median value is around 81%, 82%, a far cry from the 78.74% he reports. There are some discrepancies within other player values, but this is the one that has the greatest difference between what I got and he got.

Slayer95 commented 5 years ago

@abraker95 , your data doesn't seem right. See e.g. https://i.imgur.com/WNiuX2D.png

Furthermore, since I posted my own results above, that user has already played more matches. Now I get a median of 80.60%.

Aergwyn commented 5 years ago

As per my understanding, currently the accuracy of a user is calculated as a weighted average of the accuracies of the top plays, using the same weighting system pp does.

I couldn't find anything substantial regarding how the user accuracy is calculated in a quick search (except this forum post). If that is true however that explains why my accuracy seems to arbitrarily change (warning: personal opinion/bias/observation/etc).

I'd love to see change to get something more understandable. Even if nothing happens the minimum should be a wiki article explaining what is going on with the user accuracy.

Slayer95 commented 5 years ago

Something that I cannot neglect to point out before others start throwing their own proposals here is that my proposal number 2: "range accuracy" does in fact bivariate analysis to explore the relationship between map difficulty and accuracy. Due to its using a median split, it's a very simple computation as well as a simple concept to understand, in just 2 output values! Linear regression doesn't do better (as you must report R^2... and you are in bad luck if it's low!)

As per my understanding, currently the accuracy of a user is calculated as a weighted average of the accuracies of the top plays, using the same weighting system pp does.

I couldn't find anything substantial regarding how the user accuracy is calculated in a quick search (except this forum post). If that is true however that explains why my accuracy seems to arbitrarily change (warning: personal opinion/bias/observation/etc).

I have verified that empirically, and it does seem to be a bad approach, which is the whole point of this issue. (Note that personal opinions are the way to go in order to reach decisions -of course supported by facts in as much as possible.)

abraker95 commented 5 years ago

@abraker95 , your data doesn't seem right. See e.g. https://i.imgur.com/WNiuX2D.png

Furthermore, since I posted my own results above, that user has already played more matches. Now I get a median of 80.60%.

Yup, I forgot to factor in misses. I updated the code in the demo linked, but fixing the table will have to wait until tomorrow.