numpy / numpydoc

Numpy's Sphinx extensions
https://numpydoc.readthedocs.io/
Other
294 stars 160 forks source link

Review and document conventions that are in use #419

Open OriolAbril opened 2 years ago

OriolAbril commented 2 years ago

I am not sure if here is a good place of if it would be better to open a topic on https://discuss.scientific-python.org/. Happy to move it somewhere else if that isn't the best venue.

I am under the impression that most projects that are part of scientific-python and/or pydata ecosystem follow numpydoc without significant deviations. What varies I think is more the level of success especially if you move outside the core projects. I think one of the reasons for that is that the numpydoc style guide is not a comprehensive collection of all the rules in convention that are actually in use throughout the ecosystem (or even within numpy itself). I think missing conventions are more common, but some of the guidance is also not used or realistic. I am also not sure about how extended or de facto part of numpydoc or numpy/scipy codebase some of these conventions are which makes it hard to know if these are indeed conventions that should be in numpydoc or things that look the same by chance/contributor overlap. I'll add some examples below, but the main question is:

Would an effort to check the current style guide and find common ground with other projects be welcome?

Context: I have been trying for a while to improve the documentation of ArviZ and PyMC libraries and trying to follow numpydoc and rest of conventions used in the pydata ecosystem. I would love to simply link numpydoc, not write a whole new doc extending/reprating numpydoc (with the exception of 2-3 paragraphs at most about project specific htings like aliases defined in numpydoc_xref_aliases) and forcing contributors to navigate multiple docs to write docstrings. I also think that is numpydoc's goal, not to be only numpy/scipy specific, but I am not sure about this and if so it would be great to review it and to make sure it is up to date and as comprehensive as possible. I have added examples that I have gathered during that process (some of them below), but as I said in the beginning, the goal is not so much about the specific examples but about making sure I am on the right page about numpydoc's goals and scope and if so we can find the best way to go about it.


Some examples

Short summary

Its description is:

A one-line summary that does not use variable names or the function name

which is followed by an example using function add(a, b) described as "The sum of two numbers.".

The not use of variable names is generally followed, not using the function name however is not used and I don't think is realistic. e.g. most of the short summary sections in the linear algebra module do use the function name, virtually all methods in the random generator class have a short summary that is "Draw samples from the distribution.". And not using the function/method name would probably be less clear than using it, they refer to technical terms without any synonym available, and they can't be defined in the short summary either.

Application/scope of the doc

I was able to find project specific docstring convention pages for pandas or matplotlib but not for numpy nor scipy, they only had links to numpydoc. Am I right to assume that numpydoc alone is the official convention description for all of numpy and scipy?

There are a couple places where only numpy is mentioned such as the deprecation warning section, but most importantly, there also seems to be numpy infrastructure information (I think) in the style guide like the last couple sentences in the examples section. If I understand correctly, omitting the numpy import is only an option in numpy but not scipy docstrings for example, but I don't really know what to make of the auto-use of the plot directive when matplotlib is imported, is that part of numpydoc?

The parameter section also says "Enclose variables in single backticks" but the behaviour of single backtick enclosure in sphinx is defined by the value of default_role in conf.py. Is that intended? In general this will default to italics, but it would become code formatting for projects using "code" as default role and what might be even worse, could easily be rendered as links to the glossary if using "autolink" as default role. e.g. axis is a common parameter name and a term in numpy's glossary which is probably added to intersphinx by most projects and therefore a valid autolink key; that would mean that in the descriptions, all parameters would be rendered as italics except the axis one that would be a link (I have no background, but the fact that all parameters in https://numpy.org/devdocs/reference/generated/numpy.average.html are correctly enclosed within single backticks except for axis might point to this).

Parameters The array_like alias is used in passing only in the parameters section and explained at the very bottom of the page in the "other points to keep in mind". However, this "type" is key to many docstrings and it seems to have multiple extra conventions associated to it (shape, dtype) which are not documented in the parameters nor other points sections. The page in matplotlib's docs explains some of these conventions which seem to be the ones as used by numpy (in the linalg module for example, the input of the cholesky function is described as a : (…, M, M) array_like). Is this something that should be part of numpydoc or something that is technically matplotlib specific and happened due to maintainer overlap or something like this? Other modules like the polynomial use slightly different convention: array_like, shape (M,), are/should both be part of numpydoc? a bit like the multiple default options allowed

It might also help to add further type examples inside numpydoc itself (maybe they can go inside a dropdown to not take too much space. numpydoc says:

For the parameter types, be as precise as possible.

but there are cases I am not sure what should be the way to go when writing a docstring or don't know what to answer when someone else asks me about it. pandas for example has some extra examples that have been helpful in such situations:

list of int
dict of {str : int}
tuple of (str, int, int)
tuple of (str,)
set of str

and so does matplotlib, with some extra conventions too such as "Use (float, float) to [...] the parentheses should be included to make the tuple-ness more obvious.". Which is also used for example in numpy.histogram to describe the range parameter. Do you know if this parentheses -> tuple-ness link is more extended in numpy/scipy docs? Should it be part of numpydoc or a point of divergence by matplotlib? Is the tuple of (float, float) more common?

namurphy commented 2 years ago

Context: I have been trying for a while to improve the documentation of ArviZ and PyMC libraries and trying to follow numpydoc and rest of conventions used in the pydata ecosystem. I would love to simply link numpydoc, not write a whole new doc extending/reprating numpydoc

This sounds similar to our experience too! We have a documentation guide which includes (or will likely soon include) some special cases that aren't covered in the numpydoc standard. Having more detail in the community standard would also greatly help with linters and auto-formatters.

The parameter section also says "Enclose variables in single backticks" but the behaviour of single backtick enclosure in sphinx is defined by the value of default_role in conf.py.

Good point — I think I'm going to add this to our documentation guide as well. When we do a nitpicky doc build, enclosing variables in single backticks leads to a warning which then will cause our doc build to fail.

Parameters The array_like alias is used in passing only in the parameters section and explained at the very bottom of the page in the "other points to keep in mind".

We do :term:`numpy:array_like` to generate a link to the glossary entry in NumPy's documentation, presumably through the magic of intersphinx.

However, this "type" is key to many docstrings and it seems to have multiple extra conventions associated to it (shape, dtype) which are not documented in the parameters nor other points sections.

It would be incredibly helpful for the numpydoc standard to include how to describe shape information. I was actually about to raise an issue on this very topic when I saw this issue, and will probably need to add this to our doc guide as well.

It might also help to add further type examples inside numpydoc itself (maybe they can go inside a dropdown to not take too much space.

Agreed again: more examples for rare cases would be great!

I wonder if it would be helpful to create a separate issue for each of the topics here, since it'd make it easier to work on them individually.

Thank you for bringing all of this up!

rgommers commented 2 years ago

Would an effort to check the current style guide and find common ground with other projects be welcome?

Yes, I think that would be quite useful. And this repo is the right place for it I'd say.

Am I right to assume that numpydoc alone is the official convention description for all of numpy and scipy?

That is correct.

The parameter section also says "Enclose variables in single backticks" but the behaviour of single backtick enclosure in sphinx is defined by the value of default_role in conf.py. Is that intended?

I think that the person who wrote that used code as the default. Single backticks in numpydoc are first of all used for cross-linking; you'd also use it to link to other functions in the namespace or with intersphinx. I believe that was the intent here, but the machinery was never completed so we ended up with italics in numpy and scipy, which is not great. Improvements welcome; ideally without large-scale reformatting in numpy and scipy though.

The array_like alias is used in passing only in the parameters section and explained at the very bottom of the page in the "other points to keep in mind". However, this "type" is key to many docstrings and it seems to have multiple extra conventions associated to it (shape, dtype) which are not documented in the parameters nor other points sections. The page in matplotlib's docs explains some of these conventions which seem to be the ones as used by numpy (in the linalg module for example, the input of the cholesky function is described as a : (…, M, M) array_like). Is this something that should be part of numpydoc or something that is technically matplotlib specific and happened due to maintainer overlap or something like this? Other modules like the polynomial use slightly different convention: array_like, shape (M,), are/should both be part of numpydoc? a bit like the multiple default options allowed

Problems here include:

It might also help to add further type examples inside numpydoc itself

Yes, this would be super helpful.

rossbar commented 2 years ago

I think one of the reasons for that is that the numpydoc style guide is not a comprehensive collection of all the rules in convention that are actually in use throughout the ecosystem (or even within numpy itself). I think missing conventions are more common, but some of the guidance is also not used or realistic. I am also not sure about how extended or de facto part of numpydoc or numpy/scipy codebase some of these conventions are which makes it hard to know if these are indeed conventions that should be in numpydoc or things that look the same by chance/contributor overlap.

The mental model that I've developed from both working directly on numpydoc and with numpydoc across multiple projects is that the numpydoc style guide comprises both a docstring standard and a style guide. Some of the information in the style guide defines a standard - for example, the docstring sections must have the exact names as prescribed, or else the docstring will not be parsed/rendered properly. However there are other suggestions that are more related to style and have nothing to do with the successful parsing/rendering of a docstring (e.g. "The length of docstring lines should be kept to 75 characters to facilitate reading the docstrings in text terminals." - n.b. this advice is not really followed in any project IME, including numpy and scipy themselves).

To this I would add a third category, which is advice that has become the de facto standard in large scientific Python libraries, but which the numpydoc extension itself does not implement correctly. The most obvious example here is the "Enclose variables in single backticks" advice. As noted in the above discussion and in documentation issues throughout the ecosystem (see e.g. numpy/numpy#17714), this directly collides with the default role in sphinx which leads to excessive broken links among other things.

To summarize, the numpydoc style guide is really (IMO) a confusing mix of standard and style guide and I agree that this is a hindrance to docstring standardization across libraries.

The situation gets even more confusing when you consider the numpydoc.validation module, which implements validation checks for a variety of patterns. Many of these checks are not actually explicitly mentioned in the style guide, nor do they necessarily have a material effect on the style/quality of the docstring (e.g. "SS03", which checks that the short summary ends in a period). OTOH there are some which are clearly valuable for docstring standardization and that are explicitly mentioned in the style guide (e.g. "GL07", which checks that the docstring sections are in the prescribed order). IMO the docstring validation mechanism as a whole can be quite useful (especially now that it can be enabled as part of the doc building process); the checks themselves include many that seem arbitrary and are not necessarily community-vetted --- so much so that coming up with a set of "recommended" validation checks has proven quite difficult.

One proposal for moving forward

I think it would be very useful to review the numpydoc style guide, especially with input from consumer libraries that are outside of numpy/scipy where the standard is largely applied by default, even parts that are not necessarily documented. I'd advocate for:

OriolAbril commented 2 years ago

One proposal for moving forward

I think it would be very useful to review the numpydoc style guide, especially with input from consumer libraries that are outside of numpy/scipy where the standard is largely applied by default, even parts that are not necessarily documented. I'd advocate for:

  • Separating the components that constitute the docstring standard from those that are related to docstring style. IMO the former should be the minimal set of rules required to have compliant docstrings that parse & render correctly. I suspect that this disambiguation would also help other projects that implement the numpydoc standard, e.g. napoleon.
  • Review the recommended documentation practices from consumer libraries to find a concrete set of best-practices, and work to formalize them - with inclusion either in the standard or style guide as appropriate. I think the validation mechanism has a big role to play here as that's a nice way to add programmatic checking of additional rules/patterns.

This sounds great, could you @rossbar or someone who is already familiar with numpydoc internals and validation get started on the first bullet point? I am not sure I'd be of much use here. In the meantime I'll continue gathering info on the styles used where I think I'll be able to help.

For example, regarding single backtings, double backticks (code formatting) or asterisks (italics) both Python and matplotlib use asterisks and it is documented (not sure how enforced though). Therefore, if there aren't significant parts of the numpy codebase using double backticks (maybe from expecting code formatting as @rgommers mentioned but seeing it wasn't working) it might be possible to start using * on all new docstrings and "broken" ones like the changes in the PR you linked and progressively move away from single backticks in this role. I can open specific issues to discuss each of the potential changes/things not being followed and then PRs once there is some consensus

rossbar commented 2 years ago

I'd be happy to review things - @Carreau 's insight would also be very relevant here as I know he has been running into (and fixing) corner cases in many libraries.

progressively move away from single backticks in this role.

Given that the "single backticks for parameters" rule has been extensively adopted, I'd be much more in favor of fixing this in numpydoc rather than moving away from it.

OriolAbril commented 2 years ago

Given that the "single backticks for parameters" rule has been extensively adopted, I'd be much more in favor of fixing this in numpydoc rather than moving away from it.

I agree that using single backticks is quite common, but if that were to become part of numpydoc then we'd need to fix the default_role, otherwise the convention would only be valid for raw docstrings. And the value set for the default role is very heterogeneous. Even taking into account only the SPEC core projects, 4 different values for default_role are used.

autolink or any

NumPy, SciPy and scikit-image use autolink. This renders the text between the single backticks as italics provided sphinx finds no target with the same name. The html class that ends up being used is <em class="xref py py-obj">sum_of_weights</em>. If the argument name matches a sphinx known target, it is instead rendered as code+link formatting, the specific html is:

<a class="reference internal" href="#numpy.average" title="numpy.average">
  <code class="xref py py-obj docutils literal notranslate">
    <span class="pre">
      average
    </span>
  </code>
</a>

which doesn't seem like the expected output as differs from the previous case and in addition links to things that are potentially unrelated such as function names (linking to itself when referencing the return value is common for example) or glossary terms (which can even be on an intersphinx glossary). As I mentioned above using the numpy.average docstring, this looks like an open issue. As you mentioned, the use of single backticks when referencing function arguments is very extended, and all references to function arguments in that docstring use it, with one exception: references to axis argument.

Single backticks are also used to reference the output values. The description of returned for example is Default is `False`. If `True`, the tuple (`average`, `sum_of_weights`) is returned, otherwise only the average is returned.. This is how it looks like in the rendered docs:

image

So in addition to forcing a value for the default_role, if that value is autolink I think an alternative to parameters that clash with existing sphinx targets is needed. And adding new functions or terms on the glossary could force going back to some docstrings to update their text in order to switch from the single backticks to the alternative way.

obj

NetworkX uses obj. This behaves very similarly to the autolink/any option above but is a bit more restrictive. It will only link to python objects documented in the API pages. That would for example not have the issue mentioned above with axis but it would still have the issue with average. That also restricts the use of the single backtics. If the default_role="any" was set to do things like `manual_target` or `glossary term` skipping the :role: part it would not be possible anymore.

Note: matplotlib also uses obj as default role while at the same time using * for referencing arguments:

Function arguments and keywords within docstrings should be referred to using the *emphasis* role. This will keep Matplotlib's documentation consistent with Python's documentation:

If *linestyles* is *None*, the default is 'solid'.

Do not use the default role or the literal role:

Neither `argument` nor ``argument`` should be used.

None

Pandas doesn't set the default_role so the behaviour is the default set by docutils: :title-reference:. This renders as italics, but the html used is now <cite>names</cite>, so this formatting matching the non-reference case above is dependent on the theme and css used.

code or literal

ipython and scikit-learn (since 2019, before they used any and had problems with sphinx targets clashing). This renders as a literal block, using the monospace font. The html used is max_fpr`.

Note: ArviZ also uses literal as default role. We know we are not taking full advantage of sphinx/rst features but most of our team and contributors are much more familiar with markdown than rst, so we do this to match markdown and rst behaviour and lower the friction when contributing docs to both core devs and contributors.

rossbar commented 2 years ago

but if that were to become part of numpydoc

The thing is, it is already recommended by numpydoc and has generally been adopted by numpy, scipy, etc. so is a de-facto standard.

we'd need to fix the default_role, otherwise the convention would only be valid for raw docstrings

I think there's a way to add the necessary functionality to the numpydoc extension without clashing too severely with the default role. numpydoc can be modified to check text enclosed in single backticks against the listed parameters. If the enclosed phrase matches with a parameter from the signature (or attribute/method in the case of classes) then the role for that specific instance can be switched to a numpydoc-specific role for parameters (which would at least provide syntax highlighting, and eventually would be great to include intra-page anchor links back to the parameter description itself). This should be a relatively light-weight check, and can also be introduced via configuration, allowing users to adopt the new behavior/keep the old behavior, or even configure it on a per-docstring basis.

This is something I've been thinking about for a while and I think is a nice alternative to #303 that fixes all the issues and is flexible enough to not cause any major problems/breaks. I haven't had the time to actually implement/test the idea though.

OriolAbril commented 2 years ago

numpydoc-specific role for parameters (which would at least provide syntax highlighting, and eventually would be great to include intra-page anchor links back to the parameter description itself)

This sounds amazing!

I am guessing it would then also be possible to customize the formatting so for example, projects that want so can continue to use code formatting for parameters.

Would that be in use throughout all the docstring or only for parameter descriptions?

And would it be possible for this to work on other pages? (context: I was very confused when first reading https://numpy.org/doc/stable/reference/ufuncs.html and I think it would benefit from something like this even if not technically a docstring)

The thing is, it is already recommended by numpydoc and has generally been adopted by numpy, scipy, etc. so is a de-facto standard.

True, sorry. I am a bit too focused on the edge cases.

Carreau commented 2 years ago

I have to take more time to re-read all of that, but I have been working on trying to auto-fix some of the numpydoc syntax (https://github.com/Carreau/velin), and in my experience, numpydoc allow a bit too much flexibility in its syntax, which lead to a lot of confusion.

Numpydoc is not only used by Sphinx; https://github.com/Jupyter/papyri understand it as well, and things like backtick to link variable works. So I'd love for this discussion to not take only sphinx in consideration. IMHO this should be decomposed into two steps: numpydoc to some semantic repr, and then how sphinx renders this repr. But we should not take into consideration that *...* is better than `...` just because of a specific appearance

One other thing to keep in mind with any change, is that currently there are still a lot of tools that display the raw docstring when users ask for help on a function.

I personally would be happy for the "core" of the numpydoc style guide to be moved to a SPEC, so that each project can link to it.

adam-grant-hendry commented 2 years ago

One thing I would like to change is the convention that the short summary end in a period (I see this in Google-style docstrings too). Many times the summary is a fragment rather than a complete sentence. From a technical writing perspective, this feels sloppy, but grammar tools also complain about this.

Examples:

"""The sum of two numbers."""
"""Example NumPy style docstrings."""

The above read like headings more than complete sentences. To make them complete, one would have to add qualifiers such as

"""This method returns the sum of two numbers."""

and

"""These are example NumPy style docstrings."""

both of which feel redundant (or may get you over the dreaded 80-character line limit).

Since it's required to separate the summary from the body with a new line anyway, do we have to use a period? Or should the convention be that the short summary is a complete sentence? I'm fine with either option, but I would prefer the guidelines recommend something to establish consistency.

rossbar commented 2 years ago

One thing I would like to change is the convention that the short summary end in a period

Agreed. AFAICT this isn't recommended in the style guide, but there is a validation check for it. However I would treat the validation check as something users can adopt if they want, but not necessarily recommended.

adam-grant-hendry commented 2 years ago

A second thing I would like to gain consensus on is whether docstring guides should use python typing types and corresponding annotations because I see a lot of misuse and inconsistent descriptions for types. For instance, I see this all over in pandas:

"""
Parameters
------------
param: description instead of type
"""

e.g. pandas/compat/pickle_compat.py

def load(fh, encoding: str | None = None, is_verbose: bool = False):
    """
    Load a pickle, with a provided encoding,

    Parameters
    ----------
    fh : a filelike object
    encoding : an optional encoding
    is_verbose : show exception output
    """

should instead be

def load(fh, encoding: str | None = None, is_verbose: bool = False):
    """
    Load a pickle, with a provided encoding,

    Parameters
    ----------
    fh : Pathlike
        a filelike object
    encoding : str or None, default=None
        an optional encoding
    is_verbose : bool, default=False
        If ``True``, show exception output
    """

the above is still not fully type hinted (e.g. fh should have fh: Pathlike and the method actually returns an Unpickler.load(), but there is no return type in the method signature or docstring.

Inconsistent uses like this make linting challenging (see pylint PR #7360), but as a user I don't know what to put in for "a filelike object" (in the latter case, I can at least look up what a Pathlike object is).

I also think we should allow using:

param1: list[str]
param2: int | bool | None
param3: tuple[int, int, int]

in addition to

param1: list of str
param2: int, bool, or None
param3: tuple of int

In the last case (param3), its hard in the current notation to express things like "param3 is a 3-tuple". With the advent of the many various new PEPs for type annotations, we are beginning to build up a typing syntax that has a very explicit meaning, so it seems we should be using that so all end users know what is being talked about.