pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.48k stars 17.88k forks source link

DISC: Validating internal docstring presence, completeness and style #32773

Open simonjayhawkins opened 4 years ago

simonjayhawkins commented 4 years ago

We currently have guidelines/checks for docstrings https://pandas.pydata.org/docs/development/contributing_docstring.html for the api facing methods/functions following the numpydoc docstring guide https://numpydoc.readthedocs.io/en/latest/format.html. since we also use sphinx for building the on-line documentation.

Many of our internal functions don't have docstrings and this could be a barrier to new contributions. For the more complex/undocumented internals, I am concerned that this may also affect the sustainability of the project if only a handful of contributors can grok the code.

The pandas docstring guidelines could be considered too heavyweight for internal functions and therefore may be factor in the incomplete coverage of docstrings for internal functions.

simonjayhawkins commented 4 years ago

To start the discussion, I propose that we adopt google style guide for internal docstrings/code comments. http://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings

We could also considering using pydocstyle http://www.pydocstyle.org/en/latest/ to validate docstrings against this style. (although may involve technically difficulties having two standards)

jorisvandenbossche commented 4 years ago

What aspect of the google style guide do you propose to adopt? Or the full section about docstring comments? If that is the case, I don't really see why we would start using a different format to document parameters for internal functions vs public functions.

simonjayhawkins commented 4 years ago

The driver for adopting a different format, would be primarily to make it easier to have code checks for docstring coverage/quality for all of the codebase not just the public api. Is it feasible to use the current validation on all docstrings?

The google style also includes styles for module docstrings and pydocstyle allows these to be validated.

With the introduction of type annotations, having the types in the docstring is redundant (for internal functions) and duplication can allow inconsistencies if there is no checks to keep them in-sync. In the google style for Args, "List each parameter by name. A description should follow the name, and be separated by a colon and a space." so no need to repeat types.

To enforce a style, we need validation. Feasibility on using our current validation across the codebase could either be discussed here or raise a new issue. There is #15580 but "this issue focused on public functions" https://github.com/pandas-dev/pandas/issues/15580#issuecomment-284344237

simonjayhawkins commented 4 years ago

I would like the outcome of this discussion to be either the current "pandas docstring guide" renamed "pandas api docstring guide" with a second, much more lightweight guide.. "pandas internal docstring guide" or rewrite the current guide to acknowledge the distinction. https://github.com/pandas-dev/pandas/pull/32033#issuecomment-592517783

simonjayhawkins commented 4 years ago

Or the full section about docstring comments?

adopting "section 3.8 Comments and Docstrings" in it's entirety means that "pandas internal docstring guide" is effectively already written with pydocstyle providing the validation.

jbrockmendel commented 4 years ago

Would it be viable to maintain a single standard and gradually expand it to more files/functions?

simonjayhawkins commented 4 years ago

I opened this issue primarily in order to improve coverage, so my motivation is to achieve greater breadth rather than depth. I therefore feel that having a more lightweight standard is more likely to achieve that goal.

I do see issues with having a different standard and therefore opened the discussion with the google suggestion for feedback. I see some middle ground with the single standard approach, by just adopting a subset of the current standard for the internal docstrings.

Potentially, we could then modify and use the current validation on the subset.

We could take the subset approach a bit further changing say the requirement for types where type annotations exist. However, too may exceptions and we are in danger of creating "yet another" standard. In this were to be the case, adopting an exisiting standard, say the google standard may be more preferable.

jbrockmendel commented 4 years ago

so my motivation is to achieve greater breadth rather than depth

I'm not sure I understand the distinction. Can you give an example?

We could take the subset approach a bit further changing say the requirement for types where type annotations exist.

This relates to a topic I've been thinking about: should we treat annotations as "this is what you should pass" or "you can safely assume that this is what is passed"? e.g. in core.indexing we do the latter

simonjayhawkins commented 4 years ago

I'm not sure I understand the distinction. Can you give an example?

def sanitize_array(
data, index, dtype=None, copy: bool = False, raise_cast_failure: bool = False
):
"""
Sanitize input data to an ndarray, copy if specified, coerce to the
dtype if specified.
"""

vs

def is_sequence(obj) -> bool:
    """
    Check if the object is a sequence of objects.
    String types are not included as sequences here.

    Parameters
    ----------
    obj : The object to check

    Returns
    -------
    is_sequence : bool
        Whether `obj` is a sequence of objects.

    Examples
    --------
    >>> l = [1, 2, 3]
    >>>
    >>> is_sequence(l)
    True
    >>> is_sequence(iter(l))
    False
    """

I think that time/effort creating/reviewing docstings for functions that are less complex/more obvious would be better spent on more complex/less obvious functions.

A more lightweight standard would hopefully reduce the time/effort creating/reviewing docstrings.

simonjayhawkins commented 4 years ago

This relates to a topic I've been thinking about: should we treat annotations as "this is what you should pass" or "you can safely assume that this is what is passed"?

There's also another.. "this is what you could pass".

When I'm adding annotations, i go though the function and the inputs of the functions called to try and establish what types are acceptable.

I think there maybe some merit in being more restrictive on the type accepted, i.e. should pass.

This is not the place for that discussion though. If there is not already an issue for this, maybe we should have one.

simonjayhawkins commented 4 years ago

"you can safely assume that this is what is passed"

that is effectively what MonkeyType does. https://pypi.org/project/MonkeyType/

jorisvandenbossche commented 4 years ago

First, I am -1 on using a different (conflicting) style guide for internal docstrings than for external ones. I personally don't see the advantage of that.

I think that time/effort creating/reviewing docstings for functions that are less complex/more obvious would be better spent on more complex/less obvious functions.

That's a valid concern, but how does that relate to the styling of the docstrings? Because you have the feeling that with numpydoc style too much time goes to formatting details? And with google style this would be less?

Note that with numpydoc standard, we can also have a subset of the rules to check for internal docstrings. We don't need to check every detail of it if we don't want to.

And insn't that basically what your PR https://github.com/pandas-dev/pandas/pull/32033 did? It checked some general formatting issues (using pydocstyle) that didn't conflict with numpydoc.

simonjayhawkins commented 4 years ago

That's a valid concern, but how does that relate to the styling of the docstrings? Because you have the feeling that with numpydoc style too much time goes to formatting details? And with google style this would be less?

mainly just against the formatting of args requiring types and conforming to the numpydoc type notation which is a different standard to the python typing notation used for type annotations.

Note that with numpydoc standard, we can also have a subset of the rules to check for internal docstrings. We don't need to check every detail of it if we don't want to.

That's basically the aim of this issue and what i'd like to get consensus agreement on. A more lightweight standard. i.e. not to need to meet all the requirements of numpydoc for an internal docstring. (to apply when reviewing/creating)

And insn't that basically what your PR #32033 did? It checked some general formatting issues (using pydocstyle) that didn't conflict with numpydoc.

basically, yes. using a subset of error codes in the first instance that are compatible with numpydoc.

If you recall there was a review request to update the docs 😉 Hopefully this discussion can provide the content, see https://github.com/pandas-dev/pandas/issues/32773#issuecomment-600089934

However, #32033 didn't cover the type annotation/docstring redundancy issue.

jorisvandenbossche commented 4 years ago

mainly just against the formatting of args requiring types and conforming to the numpydoc type notation

Numpydoc doesn't require types (see https://numpydoc.readthedocs.io/en/latest/format.html#sections, it has an example of a parameter without type), it's only our internal validation that requires this. We could perfectly well decide to relax this.

That's basically the aim of this issue and what i'd like to get consensus agreement on. A more lightweight standard. i.e. not to need to meet all the requirements of numpydoc for an internal docstring. (to apply when reviewing/creating)

I think what would be interesting for the discussion then is to do a concrete proposal of which rules you want to relax or which rules ot keep.

WillAyd commented 4 years ago

First, I am -1 on using a different (conflicting) style guide for internal docstrings than for external ones. I personally don't see the advantage of that.

I share this same concern. There's already a rather large cognitive load to contributions in terms of guidelines and checks that is a lot to remember as a maintainer, and probably off-putting to new contributors.

So OK with validating internal docstrings, but I think should play by the same rules as all other docstrings in the code base

simonjayhawkins commented 4 years ago

There's already a rather large cognitive load to contributions in terms of guidelines and checks that is a lot to remember as a maintainer

Agreed. This issue is intended to address that. whether a solution is to have a more lightweight standard or use the existing standard for internal docstrings, adding automated validation of internal docstrings is key to this discussion and the checks need to be part of the CI so that maintainers don't need to check conformance to docstring style. I should have made that more clear and included this in the OP as part of the sustainability argument.

simonjayhawkins commented 4 years ago

So OK with validating internal docstrings, but I think should play by the same rules as all other docstrings in the code base

If we keep the current standard/validation for docstrings, I wonder whether it is feasible to run validate_docstrings on any methods/functions touched in a PR on CI. I suspect this could be quite disruptive.

simonjayhawkins commented 4 years ago

I think what would be interesting for the discussion then is to do a concrete proposal of which rules you want to relax or which rules ot keep.

In the first instance, I think i'll run validate_docstrings on internal docstrings on a the odd PR from time to time that changes an internal docstring and see what we feel is inappropriate and maybe we can discuss which checks could be relaxed using specific cases, rather than select a subset from the outset. see #32769

WillAyd commented 4 years ago

Did we ever fix our user facing docstrings to adhere to the docstring standards? I know @datapythonista used to run metrics on that; not sure if there is a current list

FWIW I think worth finishing that off before diving into standards for the internal functions, as there is probably lessons to be learned

datapythonista commented 4 years ago

A lot of progress was made, but we're far from done with the public API.

@martinagvilas and @galuhsahid have a dashboard on the current state. Can you share the link please?

simonjayhawkins commented 4 years ago

FWIW I think worth finishing that off before diving into standards for the internal functions, as there is probably lessons to be learned

I think there are plenty of on-line resources and experience shared on how to use pandas.

On the other hand, my concern is from a project sustainability perspective.

Many of our internal functions don't have docstrings and this could be a barrier to new contributions. For the more complex/undocumented internals, I am concerned that this may also affect the sustainability of the project if only a handful of contributors can grok the code.

for me, getting the internal functions documented is just as important as the external api.

jorisvandenbossche commented 4 years ago

for me, getting the internal functions documented is just as important as the external api.

But what is important for those internal functions is that they have a proper, understandable explanation of what they are doing / what they are used for. This is something that needs human review.

On the other hand, most of the things that a style checker does (is there a space after the colon, did it use proper punctuation and capitalization, whitespace, ...), I care about much less for internal docstrings (they also don't need to render properly online). (maybe checking that all parameters are documented is the one thing of an automatic check that would be most useful)

So in that light, I would say: yes, for fully conforming to the docstring validation script, let's first focus on the public docstrings. And at the same time, we reviewers need to pay attention that also internal docstrings have a meaningful explanation.

martinagvilas commented 4 years ago

Here is the link to the dashboard inspecting docstrings errors in the public API that @datapythonista mentioned. The current errors are updated every hour.

simonjayhawkins commented 4 years ago

But what is important for those internal functions is that they have a proper, understandable explanation of what they are doing / what they are used for. This is something that needs human review.

On the other hand, most of the things that a style checker does (is there a space after the colon, did it use proper punctuation and capitalization, whitespace, ...), I care about much less for internal docstrings (they also don't need to render properly online). (maybe checking that all parameters are documented is the one thing of an automatic check that would be most useful)

Agreed, a lightweight set of checks could therefore be that a docstring exists and that if a parameters section is included that all parameters are included and that each have a description.

So in that light, I would say: yes, for fully conforming to the docstring validation script, let's first focus on the public docstrings.

The majority view seems to be that we should have the same standard for internal/external docstrings. https://github.com/pandas-dev/pandas/issues/32773#issuecomment-600235170, https://github.com/pandas-dev/pandas/issues/32773#issuecomment-600367344 and https://github.com/pandas-dev/pandas/issues/32773#issuecomment-600271229.

I guess that we can keep the same standard for internal docstrings even if we don't validate against the complete standard on CI for internal docstrings.

And at the same time, we reviewers need to pay attention that also internal docstrings have a meaningful explanation.

makes sense. we could keep the validation minimal, say for the presence of a docstring and completeness of the parameters section. Conformance to the complete standard is at the discretion of the reviewer.

simonjayhawkins commented 4 years ago

Here is the link to the dashboard inspecting docstrings errors in the public API that @datapythonista mentioned. The current errors are updated every hour.

Thanks for posting @martinagvilas

rhshadrach commented 3 years ago

Thanks @simonjayhawkins for mentioning this on the call today. I'd like to make the following proposal.

jreback commented 3 years ago

+1 on @rhshadrach proposal