pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

DOC: Remove docs code header and use explicit imports in the user guide #28038

Open datapythonista opened 5 years ago

datapythonista commented 5 years ago

I've been discussing with the Binder team, and looks like it'll be possible soon to make the pandas examples in the docs runnable directly from the docs.

I think this will be amazing, but I think it's one more reason to start being explicit and self-contained with the docs. One option is of course that the trickiness we have now on importing things translate to Binder, and we automatically inject code for imports (and may be seeds...).

But my opinion is that the right thing is to start being explicit on what we run in the examples.

TomAugspurger commented 5 years ago

we automatically inject code for imports

+1 for removing the {{ header }} stuff in favor of explicit imports.

And also avoid aliasing our own module

I disagree here. Let's keep that as a separate issue.

jorisvandenbossche commented 5 years ago

Also +1 on explicitly doing and showing the imports. We could also see if they can be simplified. Eg randn = np.random.randn could be fixed for all occurences to use the full version. And maybe pd.options.display.max_rows = 15 can be removed with the shorter default nowadays (although if the examples use a length of eg 50, that would still be shown).

And also agree with keeping the alias discussion for a separate issue. That will be easier for this one ;)

datapythonista commented 5 years ago

Sorry for mixing things. I think there is agreement on this one, for the user guide. I was also considering the API, but I'll also open a separate issue for that, so we keep the discussions focused. Thanks for the feedback.

jreback commented 5 years ago

why the move to remove header macro? IIRC this was put in place to make sure that all of the imports are exactly the same everywhere on each doc section

this seems like churn for no gain

datapythonista commented 5 years ago

The code in the header is hidden, and we implemented the union of all the blocks with imports, seeds... to avoid duplication, and because having a huge header file in every file was transparent to the user.

The rendered header added a problem that is that Sphinx doesn't report the right line for the errors anymore (it reports the line on the rendered file, but we look for the error in the template). It was already under discussion whether we wanted to keep it or not.

But the reason for proposing this now, is that the header adds extra complexity to make the code runnable in Binder. In two different ways:

The header surely provides value in avoiding repetition, but IMO it's not worth. It's better to repeat the needed parts, and also be explicit about all what we're doing, since with Binder we'd give the users the change to play with the code, and any magic there will be confusing and unappreciated.

jreback commented 5 years ago

we’ll i still would have a fixed header of imports; having consistency in the docs overrides all other issues ; it easily becomes a mess without enforcement via CI

datapythonista commented 5 years ago

Now we validate in the CI that Sphinx doesn't generate warnings, so if an import is missing the CI should fail.

jorisvandenbossche commented 5 years ago

having consistency in the docs overrides all other issues

Note that this is only for the doc build, as it is about a hidden code block right now. And then @datapythonista is correct that this is now validated in CI that the doc build runs correctly.

Part of the proposal is also to make it no longer a hidden code block. So there we also need to ensure that what we show to the users is somewhat consistent throughout the user guide. But I think we can certainly achieve that without inserting a common header as well.

jreback commented 5 years ago

having consistency in the docs overrides all other issues

Note that this is only for the doc build, as it is about a hidden code block right now. And then @datapythonista is correct that this is now validated in CI that the doc build runs correctly.

Part of the proposal is also to make it no longer a hidden code block. So there we also need to ensure that what we show to the users is somewhat consistent throughout the user guide. But I think we can certainly achieve that without inserting a common header as well.

well this was just changed to make it consistent for each file the concern here is the same as the original

it will get inconsistent thru time across files

so if possible to have a single file with consistent imports that is then objected to the build would be preferable

jorisvandenbossche commented 5 years ago

Your last sentence is not very clear to me.

jreback commented 5 years ago

Your last sentence is not very clear to me.

right was a suggestion if we can have still a single place to import and set options for docs (and still have binder work); as this enforces consistency the best imho

jorisvandenbossche commented 5 years ago

Well, that's more or less the whole point of the issue that Marc opened: not having it in the actual rst file (but injecting it as we do now) is probably possible to have it work with binder, but it is added complexity (need to ensure the header is injected in the binder pipeline as well, as that will probably be independent from sphinx), and hence his preference to just get rid of the injected header (as we did until very recently)

datapythonista commented 5 years ago

This is what we render automatically with {{ header }}:

.. currentmodule:: pandas

.. ipython:: python
   :suppress:

   import numpy as np
   import pandas as pd

   randn = np.random.randn
   np.random.seed(123456)
   np.set_printoptions(precision=4, suppress=True)
   pd.options.display.max_rows = 15

   import os
   os.chdir(r'{}')

randn = np.random.randn is a bad idea, and just used in the release notes of 0.10.0.

I guess the rest is used in almost every page of the user guide.

I think there are two different questions:

  1. Do we want to have a variable for it (like we do now), or do we want to have it in every file
  2. Do we want to hide it , or should it be shown to users

I can see the advantages and disadvantages in both options of 1. But for 2, I think it's vert wrong to hide code, create magic to the users, not let them reproduce exactly our examples, and Binder will make that even much worse.

jreback commented 5 years ago

agree that assigning to randn is just incorrect - that should just be fixed

ideally we would expose this as user code in the docs but how do you propose to keep it consistent? that was an extreme problem before

ideally code review can do this but it has over time dropped the ball it is VERY hard to keep consistently in this large a code base (just look at some of the duplication in tests for example);

anything style related needs to be automatic and in the CI

what if you inline the code and add a style check for this?

jorisvandenbossche commented 5 years ago

Personally, I think you overestimate the "extreme problem" a bit.

Before, the header was much longer because we imported lots of things with non-standard imports, this initial hidden ipython block was a mess. But that was cleaned up in the docs over time, and now (apart from the single randn thing), all example code uses standard imports. The initial code block with imports is thus much smaller.

I personally don't see a problem with repeating this in every file.

(but in any case, it seems there is agreement over Marc's point 2) -> let's show this code block with imports to the users)

datapythonista commented 5 years ago

For now I'm happy to just remove the :suppress: and keep the header. Hiding code from the user is what I think it's a problem, not that much the header itself (sorry I mixed both in the issues)

For the header I'm more open, but I don't think it's that important to be consistent (e.g. having different seeds or different display.max_rows). May be we could have variables for them (e.g. np.random.seed({{ random_seed }})), we would still repeat some code, but we would get correct line numbers for Sphinx errors.

For being inconsistent with the imports, if an import is missing the CI will fail because code blocks will fail and we are now failing the CI for that.

jreback commented 5 years ago

i am fine with changing this but i have seen over time that it is very easy to get inconsistency from well intentioned changes.

WillAyd commented 5 years ago

This PR is well done but I'm not sure I am on board with the concept. Associating each import with its first reference in the rst feels a little arbitrary and I think is just confusing. Seems like it could be somewhat of a nuisance when moving things around; nothing major but in any case a step backwards from just knowing standard imports are done in the header

Are we dead set on going this route?

WillAyd commented 5 years ago

Sorry last comment was in reference to PR #28089 but posted here in issue

ghost commented 4 years ago

Dear @WillAyd , since the matter was discussed on #28089 , can we close this issue ?

WillAyd commented 4 years ago

I think so but @datapythonista I think can close or clarify

datapythonista commented 4 years ago

Thanks @lukelima for the triagging. #28089 removed the {{ header }} from one of the pages, but most of them still have it. So, this issue is still relevant, and can't be closed.