Discussion: Usage examples in documentation / docstrings

thatlittleboy commented 2 years ago

Hi all, just wanted to open a discussion on the state of documentation of this package.

With the most recently merged pull request (#957) and #906, I'm inferring that a decision has been made to remove all Minimal Working Examples (MWE) from the docstrings and move them instead into Jupyter notebooks -- with 1 notebook for each function (?).

Qn: If this is so, then can I understand what is the recommended way for a user to study these examples / how the pyjanitor functions should be used?

A bit more context on where the question is coming from:

I'm looking to incorporate this package more into my daily workflow, and the existing examples within the API reference have been instrumental to my understanding of what the package offers.

As far as I can tell, there are 2 locations for where examples are currently located:

examples/* in this repository
pyjanitor-examples repo, and the companion page, https://pyjanitor-devs.github.io/pyjanitor-examples/

After removing the MWE from the function docstrings (and thus, the API reference) as per #957 , is there then a plan to link up the API reference to the notebook examples, in any shape or form? That is, how is the user, coming from the API reference page, to know that there are examples available that show the functionality with sample inputs/outputs?

Take this example from PR #957

The new docstring looks like the one on the right:

I would argue that the docstring on the right is in fact less informative (!!) and the remaining "skeleton" example is essentially useless (sorry for the blunt expression), since that is essentially repeating the function parameters back to the user. (And if the point of the skeleton example is purely to inform the user there are 3 ways pyjanitor functions can be used -- method-chaining, piping, function -- then I think it is redundant since this has already been mentioned in the HomePage and there's no need to repeat this in every subsequent function docstring)

But I digress. My main point is that: the docstrings, as it is being modified currently -- examples removed with no link / mention to examples) -- is confusing to the new user who is just looking to understand what each new function is meant to do.

On potential solutions

On the note of "linking" each of the function docstrings to their respective notebook examples, I suppose there are a few ways to design it, with considerations of BOTH the organization of the src code AND the eventual user experience of reading the docs:

The old way, which is to keep the examples alongside the docstring (but I suppose this is out of the question now...)
- Docs, examples, and code are kept together as 1 unit in source code; easier to maintain, easier to enforce presence of all 3 when someone contributes new functions to the package.
- No additional development work required on the infrastructure side to ensure at least 1 simple example can be shown alongside the docstrings in the API reference. (i.e., remain status quo).
- Too verbose(?)
Keep the examples as separate files, away from the function sourceCode / function docstring. Host the example notebooks somewhere else, but ensure each function docstring has a link to the corresponding example page (preferably hosted on the same Github pages as the existing API reference)
- This leaves room for the examples to go as in-depth as required.
- But I would also like to stress: also a danger of being too in-depth. Sometimes, all the user requires is a quick tl;dr explanation of the sample input and expected output.
Same as 2 in keeping examples as separate files, but append on the examples in the API reference docs, similar to how pandas does it: https://pandas.pydata.org/docs/reference/api/pandas.pivot.html#pandas.pivot

I'm personally more in favour of 1 myself, but I suppose I'm in the minority. 😆 I genuinely don't think any of the pyjanitor function examples require a notebook to be explained thoroughly -- after all, they are just syntactic sugar for cleaning / manipulating dfs? I often see notebook examples in the context of explaining ML workflows / how to use a certain NN model (think: pytorch/dgl; training & evaluating ModelXXX on the MNIST dataset).

But barring solution 1, solution 3 seems like a nice middleground (huge fan of pandas' docs), but probably more complicated to implement than 2. If we indeed go for 2, I think we also need a tl;dr section for each notebook; but that's a different issue altogether. Thoughts?

ps: Also don't mean to knock on the efforts made in #957 too much, forgive me 😝 Happy new year all 🎉

samukweku commented 2 years ago

sniping @ericmjl. notebooks are useful for some scenarios, take for instance pivot_longer/pivot_wider. that tl-dr is true though, and i'd suspect that the api for each function should be a good enuf tl-dr. Let's get more input from the rest of the team. @pyjanitor-devs/core-devs

ericmjl commented 2 years ago

@thatlittleboy thanks for chiming in! I actually agree with your sentiments, and I also think I've been not explicit enough with what I was hoping to accomplish with #957, leading to a bit of stagnation and confusion.

You're right in observing that the docstrings become way too verbose. Additionally, maintaining the functions became difficult as the docstrings started interfering with the readability of the original source file functions.py. In our spare time, we did a big major refactor of functions.py into a submodule, in which we tried to keep as close to "one function idea per file". That helped a bit, but there was non-uniform coverage over the examples in the docstrings. Some were very well-fleshed out, while others were not. I think the cause of this was that early on, in the interest of building out the library of functions, I was quick to merge PRs and release new versions without having rigorous checks in place to ensure that all functions were documented to the same degree.

I think the coverage of examples in the library is in big need of a redo, and we can probably do a distributed sprint to make it happen.

As I see it right now, the docs examples should fulfill the following criteria:

It should fit with our existing choice to go with mkdocs, mkdocstrings, and mknotebooks.
The examples should be minimal and executable and complete execution within 5 seconds per function.
The examples should display in rich HTML on our docs page.
We should have an automatic way of identifying whether a function has an example provided or not so that every function has an example.

I did a bit of digging, and I'm still a bit unsure how to ensure satisfy all of the conditions above simultaneously. That said, option number 2 that you mentioned above, namely:

Keep the examples as separate files, away from the function sourceCode / function docstring. Host the example notebooks somewhere else, but ensure each function docstring has a link to the corresponding example page (preferably hosted on the same Github pages as the existing API reference)

seems to be the option that makes the most sense in the short term, and we probably could build up towards option 3 later on using that as a base.

@thatlittleboy would you be open to helping out with executing on option 2? I think we'd need to start first by having one minimal example per notebook.

thatlittleboy commented 2 years ago

Sure @ericmjl , I think I should be able to help with the minimal examples / tl;dr part of the sprint.

So to be clear, the "example notebooks" that we are talking about here are the ones in here, yeah? And it is 1 notebook per function? e.g. bin_numeric.ipynb would be one, add_column.ipynb would be one, and add_columns.ipynb would be another?

ericmjl commented 2 years ago

And it is 1 notebook per function? e.g. bin_numeric.ipynb would be one, add_column.ipynb would be one, and add_columns.ipynb would be another?

Yes, that is right!

If you could give me a day or two to template out the workflow, that'd be awesome. It'll give me a chance to work out potential kinks before we go all-in on this way of handling minimal working examples in docstrings.

ericmjl commented 2 years ago

@thatlittleboy I did a few tests and ultimately found that putting minimal working examples in the docstrings is the best thing to do. We get free integration with doctests & pytest, for example! The examples also render well too.

In my latest PR #971, I made a few infrastructural changes as well to clear up the CI. Once that one gets merged, the other PRs that you've got should merge in latest dev, and the CI issues should go away!

thatlittleboy commented 2 years ago

Looks great @ericmjl , thank you. I think this is a good direction forward, especially for offering clear, short examples to new users of pyjanitor. 👍🏻

ericmjl commented 2 years ago

@thatlittleboy I'd like to invite you onto the dev team. Can you ping me on Shortwhale so I can send you a link to join the Discord server? http://www.shortwhale.com/ericmjl

thatlittleboy commented 2 years ago

Yep, pinged!

pyjanitor-devs / pyjanitor

Discussion: Usage examples in documentation / docstrings #968

On potential solutions