python-sprints / pandas-mentoring

Mentoring new pandas contributors.
BSD 3-Clause "New" or "Revised" License
6 stars 30 forks source link

Create notebook with histogram of number of wrong docstrings #139

Closed datapythonista closed 5 years ago

datapythonista commented 5 years ago

In pandas there are many docstrings that have known errors, like parameters that are not documented, examples that do not run, formatting issues...

We have a script that is able to generate all them in a json file (you need a pandas development environment to run it, and should be run in an updated master branch):

./scripts/validate_docstrings.py --format=json > pandas_docstring_errors.json

After generating the json file, we need a jupyter notebook that opens that file in pandas, and shows how many of each error need to be fixed. The resulting notebook can be added to a notebooks/ directory in this repo.

TanyaaCJain commented 5 years ago

I would like to try to work on this issue.

TanyaaCJain commented 5 years ago

@datapythonista Do you want the master branch of pandas be added to the Conda environment or as a submodule in this repo?

datapythonista commented 5 years ago

No, I'm happy to have just the notebook. If you want you can add a comment at the beginning saying that to run the notebook an environment with a recent version of pandas is needed. But I don't think even that is necessary, just the notebook is enough for me.

TanyaaCJain commented 5 years ago

Ohh I assumed you wanted the notebook to update with the changes in master repository using the CI.

datapythonista commented 5 years ago

We can do that in the future, sounds like a good idea. But I'd start simple, just adding the JSON file (may be zipped) to this repo, and a notebook that opens it, and check how many errors we have pending.

Autogenerating the file sounds good, but I'd recommend never do anything that complex directly. Always go step by step and build things in an iterative way. There are a lot of talented people in this group, if you open small PRs and gather feedback at every step, the final result will surely be much better than if you work in something big by yourself. And using the divide and conquer approach will also make your life much easier.

TanyaaCJain commented 5 years ago

Maybe if ever planned, the autogeneration of file can be directly worked in the pandas repo. And yes, I agree doing it step-wise with everyone's help would make this task much easier and effective!

datapythonista commented 5 years ago

Yes, probably doing a clone of pandas master, compiling and then running the script would be the best.