documentation worker bee?

ellisp commented 7 years ago

I've been struck at several recent threads on Twitter about poor documentation in R packages including some well known ones (I don't have examples to hand but wouldn't be hard to do so). For example, examples don't give much coverage of arguments, descriptions of purpose are vague, argument descriptions are cryptic, no vignettes, etc. One quote stuck with me "most R documentation isn't for users, it's for passing CRAN checks".

Could we do something about this? Either (or both)

pick some important packages with sub-optimal documentation that the owners are likely to accept a pull request and fill in some of the missing documentation?
if functionality missing that helps assess the coverage of documentation (I haven't checked what's already out there) we could develop it (obviously this would help with the other step too).

jonocarroll commented 7 years ago

A package which analyses the current state/coverage of documentation would be very useful, and could perhaps find a home in something like https://github.com/MangoTheCat/goodpractice

For Examples coverage, I don't think it would be too hard to grab the formals for each function and we could think about how to parse an input roxygen or generated Rd file. Figuring out which possible options are available is tricky in the cases where developers don't use the list-default mechanism, but in those cases at least we could show how many have been demonstrated.

stephstammel commented 7 years ago

This would be of huge assistance to the 'emerging R user' segment - people with no dev background that find it hard to literally decode the implied knowledge in some documentation.

timchurches commented 7 years ago

The idea of filling in documentation gaps is attractive, but with > 10k packages on CRAN and goodness knows how many more on GitHub etc, where does one start?

Perhaps that rhetorical question can be answered, quantitatively?

The goodpractice package mentioned by @jonocarroll is pretty nifty, and it reports on missing documentation inter alia as a nice queryable data object, which when rendered looks like this:

✖ fix this R CMD check WARNING: Undocumented code objects: ‘print.sm_response’ ‘print.sm_survey’ ‘surveypreview’ All user-level objects in a package should have documentation entries. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual. ✖ fix this R CMD check WARNING: Codoc mismatches from documentation object 'surveyresponses': surveyresponses Code: function(survey, response_format) Docs: function(survey) Argument names in code not in docs: response_format

So, it would be possible to walk through the packages on CRAN (or some subset thereof) or GitHub or rOpenSci, and compile metrics on missing documentation (and other issues) for each package. Even the number of lines in all the manual entries for each package would be useful, as well as the number of functions, classes etc in each package, so that some rough metric of lines of documentation normalised by package "size" or complexity could be compiled for each package. With such a normalised metric in hand, it would be possible to rank packages for attention (to documentation etc). Weighting the ranking by popularity of packages would arguably a useful refinement (maybe) for focussing effort.

Has this already been done?

There's packagemetrics from rOpenSci17, which has similar but broader aims, but is focussed on the common and often vexing problem of comparing a number of packages which are candidates for use for a specific task or range of tasks.

Then there are also discussions around navigating the R package universe from the 2017 useR! conference (and see also https://juliasilge.com/blog/navigating-packages/ and https://juliasilge.com/blog/package-guidance/ ).

None of these seem to mention the systematic collection of metrics to focus effort on improving documentation. Maybe such metrics make for invidious comparisons, although the intent would not be to criticise or shame packages (or package owners), but rather to motivate, focus and inform effort on improving their documentation? Maybe such metrics don't even have face-validity? Maybe evaluating 10k packages on CRAN is too large a task, computationally (although CRAN is mirrored on AARnet in at least two places, and thus pulling the source code versions of packages and evaluating them should be quite fast, and it is a task which is trivially parallelisable...).

@stephdesilva

This would be of huge assistance to the 'emerging R user' segment - people with no dev background that find it hard to literally decode the implied knowledge in some documentation.

I agree, some of the R documentation (particularly the older, core stuff) seems almost wilfully and perversely obscure, even obfuscated.

I wonder if some sort of readability metric could be developed for R documentation? It might even be possible to develop a useful ML model assign a documentation readability metric to each package, although that would require a lot of labelled training data. If a sufficient number of documentation readability raters could be recruited from amongst the international R community, then a crowd-sourced set of ensemble readability metrics for enough packages might be able to be assembled to train such a model? Random assignment of packages to volunteers would be needed to prevent gaming. Then all 10K+ CRAN packages etc could be assessed for documentation readability, perhaps with continuous model refinement based on feedback on the derived metrics (with guards against gaming of the feedback system)?

njtierney commented 7 years ago

Great topic!

There has been some discussion of package documentation at this years unconf in the USA as well, @gaborcsardi discussed two ways to auto-identify bad doco - http://wooorm.com/readability/, and http://alexjs.com/

But in general, the way I see it there are two steps to this:

Find the documentation that needs fixing
Fixing the actual documentation

Finding the documentation that needs fixing

automatigically

Automatically finding the doco that needs fixing is a great idea. I think that there is definitely some possibility of flagging documentation using ML, http://wooorm.com/readability/, or http://alexjs.com/ , or perhaps even looking at the grammar with the gramr package by Jasmin Dumas @jasdumas. So, one outcome of this is some R package that has functions that perform some modelling to flag / identify the packages that need documentation improvement. Perhaps giving it a score.

manually

We can collate together some examples we've listed from our own experience. My favourite go-to example is prop.table.

Fixing the documentation itself

I see three pieces here:

The package is on github

If an R package is maintained on GitHub, it is relatively straightforward to fork it, update the doco, and submit a pull request. So, an outcome here is to submit a pull request to a github repo, perhaps improving the wording, grammar, or adding a vignette.

The package is on R-Forge or just on CRAN

This involves contacting the maintainer directly over email and sending them the updated documentation. Not sure if R-Forge has something fancy here. An outcome could be updated documentation and an email to the maintainer.

The package is built into base or a base package

For example, prop.table is part of baseR.

Here, you would need to submit a patch, which I haven't done, but I believe @earowang has done this. A key outcome here would be to document the process of doing this, describing the good, the bad, and the ugly. Perhaps then a blog post written about how to best do this.

Next steps?

Overall, I think that this is a super important thing, and something that a lot of new users definitely struggle with. We can also start thinking about looking ahead and even submitting an R Consortium grant along the lines of improving the documentation in general.

stephstammel commented 7 years ago

I think the segment of the R community we want to fix the documentation for is quite relevant here - our 'emerging R users' or 'wanting to be former excel users' (certain level of overlap there) have different needs to those who are firmly entrenched in R. (Nothing wrong with wanting to fix it for everyone - but a segmentation approach may create smaller projects that can be handled individually.)

For those early-stage users, the simple fact that there may not be a vignette or enough examples may be enough to push the documentation of the package into the red zone for them- but they use fewer critical packages. For this segment, really detailed vignettes with lots of examples or linked outside documentation (e.g. blog posts) would be very useful.

Whereas for those more firmly inserted into the ecosystem, they use a huge array of packages that would need coverage, but have a higher tolerance for missing vignettes/vague documentation.

njtierney commented 7 years ago

Tagging @rdpeng here, one idea could be to focus on those packages that are most used and are most stable? E.g, some base rstats packages.

Another thing to consider is that some functions that are widely used might not have vignettes - for example, lm and glm are in the stats package, which is large, and in some sense it makes sense that it doesn't have a vignette - there would be a lot! What can we do there?

stephstammel commented 7 years ago

I'd suggest for the 'mega packages' we could produce a series of examples. For instance, there are two specific classes of 'new users' to the lm and glm packages: new to models, new to R. There's the usual intersections and so on there.

'New to R' would benefit from a different set of examples to those who are 'New to models', but maybe have a good background in compsci. The aim here shouldn't necessarily be to teach modelling per se, but these are two different sets of needs. The latter might be well served by simply adding a couple of 'find out more about...' links into the examples.

rdpeng commented 7 years ago

I have two thoughts here:

We could go around and have everyone name a function that they use a lot but needs improved documentation (e.g. my vote would be glm()). I bet we'd get a diverse set of functions there. I think if someone were to write a vignette titled "The ins and outs of lm() and glm() in R" it would be well-appreciated.
Look at CRAN package download stats and find the most downloaded "user-facing" packages (i.e. maybe skip Rcpp) and go through them to pick some high priority cases.

kimnewzealand commented 7 years ago

You can also break this package documentation issue a bit further:

A. Existing functions or packages

Finding

@rdpeng Another suggestion for the finding is to use twitter and reach out either before or during the unconference. This also generates some fodder for the twitter storm, issue #34. https://github.com/ropensci/ozunconf17/issues/34.

Looking at the documentation as a function and package user, without much knowledge of how they are created and CRAN check requirements (yet :) ), the "Examples" as mentioned by @stephdesilva are really key to using it versus abandoning it. The first "Example" could be the simplest form with the mandatory arguments, for the new package/function user eg plot(cars) which quickly generates a plot. This is the form that you would use first to see if this is what you need, and if so, you could go on to add parameters through looking through the rest of the documentation, "Examples" or vignettes.

My other comment is how it seems that in scripts it is expected to include a line by line commentary with code, could there be more commentary in the "Examples"?

Fixing

Prioritise, then Plan and Fix such as above @njtierney @timchurches @jonocarroll

B. New packages

For the new function and package creators going forward, what will they use as their reference point for their documentation, the mega packages? the CRAN requirements? online tutorials? Is this reference point and precedent something to think about while working through A.

And for new function / package users I had an idea for a new dynamic tutorial package that is a cross between a youtube, a shiny app and something else.

smwindecker commented 7 years ago

Even with some programming under your belt, let alone for more novice users, some package documentation is relatively incomprehensibility. I love the idea of writing up some vignettes for common functions such as lm(), but perhaps as @kimnewzealand mentioned in order to make this contribution a bit more lasting (as fixing a few will only be a drop in the bucket) we could also aim to have an output be a 'best practice guide' of sorts, establishing levels of detail for new package examples and vignettes.

stefaniebutland commented 7 years ago

Love the ideas here. Depending on whether this is pursued and what progress is made, rOpenSci might be interested in hosting a blog post about this. We definitely like to promote the importance of good documentation. Ping me here after the ozunconf if you're interested.

Flavours of unconf blog posts here: https://ropensci.org/tags/unconf/

MilesMcBain commented 7 years ago

One problem I see with R's documentation more generally is that doesn't follow established norms in software programming. So if you've done programming in any other language, you're going to have to learn the way R does documentation before you can use R. That's an annoying bit of resistance to new user's learning at a crucial point in the learning curve. An example of something 'normal' would be like: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array - The key thing this is doing better than R, for me, is it shows you all the related methods for the 'package' or class in this case. The convention is to list these down the left hand side of the page. Java, Ruby, C++ etc all have docs that do this.

I think the lack of wider context in R help plays into other quirky R conventions like the pack_ prefix so that an autocomplete list can be used to list all a package's methods. Another response to this strangeness is pkgdown, with this you can get much closer to traditional software help, but you don't have an easy way to discover it or access it from within R. I've dipped my toe into this problem recently with mmmisc::webdocs(<pkgname>) which will navigate you immediately to the online help for a package if it exists. A much nicer interface would be something like webhelp(<pkg_func>) to take you to the online help for specific func.

I am wondering if we can sidestep the whole getting R-Core to update documentation thing by creating modern web-help driven API/workflow. So we reproduce core documentation online, where it is easily augmented with additional information to add clarity. You could think of @jonocarroll's Back to the Source as a proof of concept for this.

njtierney commented 7 years ago

Awesome points, Miles! I think that the idea of creating core documentation online is a good one, but would require some careful thought to ensure it can be trusted and extended easily.

A related effort, @ColinFay has converted official R documentation into bookdown format, - you can see them here.

jonocarroll commented 7 years ago

Some related background: I was looped into the R Documentation Task Force (an R Consortium funded project) meetings where the goal was to improve the documentation system by making documentation a linked/stored part of the object. This may or may not have stalled a little, and I'm not entirely sure I was keen on the direction in which it was going, but it sounds like R Core may have a bigger role to play in making documentation easier/better e.g.

https://lists.r-consortium.org/pipermail/rconsortium-wg-dtf/2017-July/000092.html

in the form of annotations. Martin's comment in the mailing list refers to Jan Vitek's presentation here: https://www.r-project.org/dsc/2017/slides/Annotations_for_R.pdf where he details a proposal for annotations which would allow type contracts and documentation (and presumably checking of the overlap between documentation and formals). It can be seen in action here: https://github.com/aviralg/r-3.4.0.git if you're willing to install a full new R source.

I mention all of this because I've seen too many "I'll just do the documentation myself" examples

https://www.rdocumentation.org/ (which seems to be what @MilesMcBain is proposing)
https://mran.microsoft.com/packages/
http://finzi.psych.upenn.edu/
https://rdrr.io/r/
(I hadn't realised they shut it down, but...) https://stackoverflow.com/documentation

I'm all for a bestpractice-esque project which helps one see what coverage their own documentation has as they write it. You can lead a horse to water and all that but there will always be packages with CRAN-compliant-but-terrible documentation.

My BTTS package was aimed at closing the gap between documentation and source, but that's only useful if you can parse the source better than the documentation.

MilesMcBain commented 7 years ago

Yeah but they're no good, MY documentation solution will....

Yeah okay. I take your point. The fact there are so many of these suggests my feelings are widespread. So maybe the sidestep is a dead end. Can we make it reallt easy to submit a documentation change?

jonocarroll commented 7 years ago

Obligatory XKCD:

standards

ropensci / ozunconf17