Software Paper Review: Suggestions for Clarity and Completeness

Naeemkh commented 11 months ago

Please consider the following in drafting the software manuscript:

Provide more contextual information about the problem you are solving in the paper and software, targeting software engineers and researchers who may not have specialized knowledge in the domain.
There is a typo in the summary: know --> known.
For claims about the software being "Faster" and "accurate", please offer supporting evidence like benchmarks, examples, or descriptions of the steps taken to achieve these qualities.
Some of the limitations of the software are mentioned in the FAQ section of the documentation. It would be beneficial to dedicate a discussion section in the paper that covers what the package can and cannot do, as well as future development plans.
The paper should give appropriate credit to the OpenSkills.js package.
Please incorporate at least one example in the paper that demonstrates how to use the package, enabling readers to start using it quickly.

This issue is related to this submission: https://github.com/openjournals/joss-reviews/issues/5901

vivekjoshy commented 11 months ago

Thank you. I will make these changes as soon as possible.

matt-graham commented 10 months ago

Further to @Naeemkh comments above, some additional comments and suggestions on the updated version of the paper in #116

Summary

The summary section needs to be made more suitable for a diverse and non-specialist audience with some context about problem. What do you mean specifically mean by 'Online ranking' and what specifically is a rank in this context? I would add a couple of sentences just to give some introduction to the problem being considered for readers who may have no prior knowledge of online gaming, for example

Online gaming communities will typically assign ranks to players based on the outcomes of the games they play, with higher ranking players expected to exhibit higher skill in games. These ranks are used when matching up players and teams for new games, with an aim of ensuring games remain competitive with not too large a disparity in player or team skills.

Ideally avoid short forms like 1v1 which will not necessarily be clear to all readers.

There are several vague, unreferenced or otherwise backed up claims that should either be better justified or made more specific:

'most ranking systems are designed for 1v1 games' - what do you mean by 'most' here? I would suspect it would be very difficult to catalogue all proposed ranking systems and so make an assertion either way with regards to how many support games with more than two players. If you mean there are a few well established and commonly used ranking systems that are only suitable for two-player games then it would be better to explicitly name and reference these.
'do not scale well to multiplayer games' - scale in what regard? That the computation required to update the skill ratings / ranks becomes infeasible? That the memory requirements required to store and compute the rankings becomes too large? That the rankings become non-informative / unreflective of perceived player skill?
'If they are designed for that purpose' - this is difficult to parse because of the use of the demonstratives 'they' and 'that' - it would be clearer if you phrased this more explicitly, for example: 'If ranking systems are designed for the purpose of ranking multiplayer games'
'the algorithms available are inefficient' - similar to comment about scaling above, what specifically is mean by inefficient here? Computationally inefficient in terms of processing time / memory requirements? Statistically inefficient in terms of giving poor estimates of player and team skills for a given amount of observed game outcome data? And if referring to computationally efficiency do you mean algorithms or implementations of those algorithms?
'faster than and just as accurate as proprietary algorithms' - 'faster than' and 'accurate as' while a bit more specific than previous comparatives, are still vague - faster on what task (computing rankings from a set of game outcome data, making predictions and/or matching players given a set of estimated rankings?) and using which implementations and what hardware? Accurate in what metric and what is being used as ground truth? Which proprietary algorithms? Aside from the vagueness, this is quite a strong claim and need some evidence supporting it.
Statement of need

'MMO' needs defining on first usage.

'Similar to TrueSkill' - @herbrich2006trueskill reference should probably come here at first mention. Ideally you should also give a very brief (one or two sentences) overview of what TrueSkill is for context.

'OpenSkill offers a pure Python implementation of their models' - from a brief read through, it seems the main contribution of Weng and Lin (2011) is a methodology / algorithm for approximating the Bayesian updates to player skill estimates given outcome data for a series of previously proposed probabilistic models for ranked data. I think it would be clearer therefore to say something like 'OpenSkill offers a pure Python implementation of their Bayesian approximation method for probabilistic models of ranked data'. or 'approximate Bayesian inference algorithm for estimating the parameters of probabilistic models of ranked data'.

'designed for asymmetric multi-faction multiplayer games' - I would say defining what is meant by asymmetric and multi-faction here would be helpful.

'However OpenSkill boasts several advantages over proprietary models like TrueSkill' - 'TrueSkill model' is not entirely clear here, as TrueSkill framework combines both a specific probabilistic model and an expectation propagation based approximate Bayesian inference algorithm for updating the model parameters. Likewise OpenSkill is an implementation of a specific algorithm for estimating the parameters of several different probabilistic models. Ideally you should make clear whether a claim is with regards to the models used, algorithm or specific implementation.

For the claim of 'significantly faster rating updates, with performance gains of up to 150%' - what is being compared here and on what data? A specific implementation of TrueSkill (the Python trueskill package?) against OpenSkill(.py) with a specific model? If so how can we be sure any performance differences are not due to inefficiencies in the particular implementation of TrueSkills used? The specific figure of 'up to 150%' is also not that clear - is that a speed-up by a factor of 1.5? The benchmark results included in the updated paper suggest more like a factor of 3 speedup.
For the claim 'OpenSkill features a more permissive license, ' - again what is being compared to here? The code of the Python trueskill package documented at https://trueskill.org/ for example is released under a BSD license which is similarly permissive to the MIT license of OpenSkill. While I realise that the TrueSkill name itself is trademarked, and there is a patent for the corresponding system which limits its use to non-commercial applications, that is not made clear from what is written in the paper.

Benchmarks

This section should explicitly state what TrueSkill implementation is being compared to (from the code I believe this is Python trueskill package available on PyPI). It would also be worth at least mentioning (as already discussed above) that the differences in performance seen here may be partly down to the relative efficiency of the implementations.

'Using a dataset of overwatch matches and player info' - 'overwatch' should be capitalized to make clear it is a proper noun and either a citation to a URL explaining what it is included or a footnote / inline comment giving some context. 'info' should be written in full as 'information'

'Using a dataset of Overwatch [citation to a URL or document explaining what Overwatch is] and player information'

'predicts the same number of matches as TrueSkill' - I think something like 'gives a similar predictive performance to TrueSkill' would be better here.

Comparison to related packages

You currently do not directly mention any other commonly used software packages for online player rating estimation, or state how OpenSkill.py compares to them in the main paper (there is brief mention of openskill.js in Acknowledgements in updated paper). Given the acknowledgement in the project README that 'this project is originally based off the openskill.js package', this should also be made clear in the paper, along with some overview of the relative advantages of the implementation here (even if this is just being available in Python as opposed to JavaScript). Similarly some of the implementations in other languages references in the README could be mentioned in the paper, as could the Python trueskill package (https://trueskill.org/) and any other similar packages you are aware of.

References

Two instances of 'bayesian' need to be capitalised by adding braces {} in paper.bib.

vivekjoshy / openskill.py