pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.64k stars 17.58k forks source link

Fix capitalization among headings in documentation files #32550

Open tonywu1999 opened 4 years ago

tonywu1999 commented 4 years ago

In #26933, we made the capitalization of titles consistent. For example, a title used to be capitalized like, "This is the Section Title", and many of the titles in the pandas documentation was changed to a correct format, like "This is the section title".

In #31114, we made a script called scripts/validate_rst_title_capitalization.py that extracts all titles in the documentation, making sure that only the first letter of the sentence is uppercase, or words defined in a short list, like Series, DataFrame, etc. The script also outputs how to fix the title as well.

We validated capitalization is correct by integrating this script into CI (continuous integration). The idea is that we should run this script through ci/code_checks.sh, and when title capitalization errors show up on CI, the user should fix those errors on the specified files.

To verify the code is working on your side, the command below instructs the program to validate the doc/source/development/contributing.rst file. There should be no output from this command as this file as no capitalization errors:

./scripts/validate_rst_title_capitalization.py doc/source/development/contributing.rst

This command below instructs the program to validate both doc/source/index.rst and doc/source/development/policies.rst files.

./scripts/validate_rst_title_capitalization.py doc/source/index.rst doc/source/development/policies.rst

This command produces the output below:

doc/source/development/policies.rst:9:Heading capitalization formatted incorrectly. Please correctly capitalize "Version Policy" to "Version policy" 
doc/source/development/policies.rst:51:Heading capitalization formatted incorrectly. Please correctly capitalize "Python Support" to "Python support"

The goal of this issue is to correct the title capitalization of all files in the pandas documentation. In order to see all titles that need to be validated in the documentation folder, one should run the following command below on the command line.

./scripts/validate_rst_title_capitalization.py doc/source

This program validates all RST files in the doc/source folder. Once all titles are all correctly validated, we would like to add the above command into the ci/code_checks.sh file.

Here's a checklist of all the files that had at least one incorrectly capitalized heading:

- [ ] doc/source/user_guide/timedeltas.rst
- [ ] doc/source/whatsnew/v0.7.0.rst
- [ ] doc/source/whatsnew/v0.23.4.rst
- [ ] doc/source/whatsnew/v0.6.0.rst
- [ ] doc/source/whatsnew/v1.0.2.rst
- [ ] doc/source/whatsnew/v0.18.0.rst
- [ ] doc/source/whatsnew/v0.16.2.rst
- [ ] doc/source/whatsnew/v0.7.1.rst
- [ ] doc/source/whatsnew/v0.8.0.rst
- [ ] doc/source/user_guide/integer_na.rst
- [ ] doc/source/reference/io.rst
- [ ] doc/source/user_guide/computation.rst
- [ ] doc/source/whatsnew/v0.16.0.rst
- [ ] doc/source/whatsnew/v0.23.2.rst
- [ ] doc/source/whatsnew/v0.12.0.rst
- [ ] doc/source/getting_started/10min.rst
- [ ] doc/source/user_guide/advanced.rst
- [ ] doc/source/reference/arrays.rst
- [ ] doc/source/development/maintaining.rst
- [ ] doc/source/user_guide/groupby.rst
- [ ] doc/source/user_guide/cookbook.rst
- [ ] doc/source/development/developer.rst
- [ ] doc/source/development/meeting.rst
- [ ] doc/source/getting_started/intro_tutorials/03_subset_data.rst
- [ ] doc/source/whatsnew/v0.4.x.rst
- [ ] doc/source/whatsnew/v0.16.1.rst
- [ ] doc/source/whatsnew/v1.0.0.rst
- [ ] doc/source/whatsnew/v0.23.1.rst
- [ ] doc/source/getting_started/tutorials.rst
- [ ] doc/source/reference/series.rst
- [ ] doc/source/getting_started/intro_tutorials/02_read_write.rst
- [ ] doc/source/whatsnew/v0.6.1.rst
- [ ] doc/source/whatsnew/v0.13.1.rst
- [ ] doc/source/whatsnew/v0.21.0.rst
- [ ] doc/source/reference/frame.rst
- [ ] doc/source/whatsnew/v0.20.0.rst
- [ ] doc/source/getting_started/intro_tutorials/09_timeseries.rst
- [ ] doc/source/whatsnew/index.rst
- [ ] doc/source/user_guide/merging.rst
- [ ] doc/source/whatsnew/v0.18.1.rst
- [ ] doc/source/user_guide/enhancingperf.rst
- [ ] doc/source/development/contributing_docstring.rst
- [ ] doc/source/whatsnew/v0.9.0.rst
- [ ] doc/source/whatsnew/v0.25.2.rst
- [ ] doc/source/development/extending.rst
- [ ] doc/source/reference/window.rst
- [ ] doc/source/whatsnew/v0.7.3.rst
- [ ] doc/source/user_guide/options.rst
- [ ] doc/source/ecosystem.rst
- [ ] doc/source/getting_started/intro_tutorials/01_table_oriented.rst
- [ ] doc/source/user_guide/categorical.rst
- [ ] doc/source/whatsnew/v0.14.1.rst
- [ ] doc/source/whatsnew/v0.19.0.rst
- [ ] doc/source/whatsnew/v0.20.2.rst
- [ ] doc/source/whatsnew/v0.24.0.rst
- [ ] doc/source/development/roadmap.rst
- [ ] doc/source/whatsnew/v0.17.0.rst
- [ ] doc/source/user_guide/boolean.rst
- [ ] doc/source/getting_started/comparison/comparison_with_r.rst
- [ ] doc/source/whatsnew/v0.17.1.rst
- [ ] doc/source/whatsnew/v0.22.0.rst
- [ ] doc/source/reference/indexing.rst
- [ ] doc/source/user_guide/missing_data.rst
- [ ] doc/source/getting_started/install.rst
- [ ] doc/source/user_guide/index.rst
- [ ] doc/source/user_guide/visualization.rst
- [ ] doc/source/getting_started/comparison/comparison_with_stata.rst
- [ ] doc/source/whatsnew/v0.19.1.rst
- [ ] doc/source/whatsnew/v0.15.1.rst
- [ ] doc/source/whatsnew/v0.10.0.rst
- [ ] doc/source/whatsnew/v0.19.2.rst
- [ ] doc/source/whatsnew/v0.25.3.rst
- [ ] doc/source/user_guide/gotchas.rst
- [ ] doc/source/whatsnew/v0.14.0.rst
- [ ] doc/source/user_guide/reshaping.rst
- [ ] doc/source/reference/groupby.rst
- [ ] doc/source/whatsnew/v0.23.3.rst
- [ ] doc/source/user_guide/timeseries.rst
- [ ] doc/source/whatsnew/v0.9.1.rst
- [ ] doc/source/getting_started/comparison/comparison_with_sql.rst
- [ ] doc/source/whatsnew/v0.24.1.rst
- [ ] doc/source/reference/index.rst
- [ ] doc/source/development/policies.rst
- [ ] doc/source/whatsnew/v0.21.1.rst
- [ ] doc/source/whatsnew/v0.20.3.rst
- [ ] doc/source/development/code_style.rst
- [ ] doc/source/user_guide/sparse.rst
- [ ] doc/source/whatsnew/v0.24.2.rst
- [ ] doc/source/whatsnew/v0.15.2.rst
- [ ] doc/source/whatsnew/v1.1.0.rst
- [ ] doc/source/reference/offset_frequency.rst
- [ ] doc/source/whatsnew/v1.0.1.rst
- [ ] doc/source/getting_started/basics.rst
- [ ] doc/source/whatsnew/v0.5.0.rst
- [ ] doc/source/user_guide/text.rst
- [ ] doc/source/user_guide/indexing.rst
- [ ] doc/source/whatsnew/v0.11.0.rst
- [ ] doc/source/whatsnew/v0.8.1.rst
- [ ] doc/source/getting_started/comparison/comparison_with_sas.rst
- [ ] doc/source/whatsnew/v0.23.0.rst
- [ ] doc/source/user_guide/io.rst
- [ ] doc/source/whatsnew/v0.25.1.rst
- [ ] doc/source/whatsnew/v0.13.0.rst
- [ ] doc/source/whatsnew/v0.25.0.rst
- [ ] doc/source/whatsnew/v0.15.0.rst
- [ ] doc/source/whatsnew/v0.10.1.rst
datapythonista commented 4 years ago

@tonywu1999 do you mind editing the description and providing more context? Imagine a random user wanting to contribute to pandas lands here. We would like to explain what's the problem, why it's useful to fix it, and step by step information on what to do (e.g. We want to add fixes files to ci/code_checks.sh).

Also, if you want to get the list of files to check, and add it in the description (you can use - [ ] docs/source/whatever.rst, so we can easily check the ones that are fixed).

Thanks!

themien commented 4 years ago

take

themien commented 4 years ago

@tonywu1999 working on the issue I am getting some outputs that I am not sure are valid. If I run the script on /doc/source/whatsnew/v0.25.0.rst for example I get this a part of the output: /doc/source/whatsnew/v0.25.0.rst:561:Heading capitalization formatted incorrectly. Please correctly capitalize "Indexing an IntervalIndex with Interval objects" to "Indexing an intervalindex with Interval objects" or: /doc/source/whatsnew/v0.25.0.rst:1087:Heading capitalization formatted incorrectly. Please correctly capitalize "-" to ""

Can you confirm that this is an expected output?

datapythonista commented 4 years ago

We just developed this validation script, so it's expected that we find some false positives. Can you find where this error is being generated, so we can see what's the problem?

tonywu1999 commented 4 years ago

It looks like those lines in the .rst files are used as bullet points rather than headings. However, those bullet points appear to be empty (i.e. they may have been inserted into the .rst file by accident). You can refer to the following website to see what I mean by empty bullet points:

https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html

control-f for

when passing a dict of columns and types the

to find the empty bullet points and to give context on what's going on.

Hope this helps.

datapythonista commented 4 years ago

There is a condition in the script where we check that a line just contains dashes (or other specific characters, and that the length of the analysed line and the previous have the same length. I guess we want to add another condition that the length should be greater than one and the previous line shouldn't have one of these specific characters.

May be that can be implemented in a separate PR, or together with fixing a single line where this happens.

themien commented 4 years ago

@tonywu1999 @datapythonista I believe there are a few capitalization exceptions missing like IntervalIndex or RangeIndex.

Also in find_titles() this statement is selecting the empty bullet points.

line_chars = set(line)
            if (
                len(line_chars) == 1
                and line_chars.pop() in symbols
                and len(line) == len(previous_line)
            ):
datapythonista commented 4 years ago

No problem on changing whatever is needed in the script.

cleconte987 commented 4 years ago

Hello i'd like to work on it

cleconte987 commented 4 years ago

Though I don't understand exactly what is the issue or the goal of the issue here. The script does it well to find all occurrences of titles that need to be decapitalized, is it to actually make the changes to the documentation?

tonywu1999 commented 4 years ago

The goal of this issue is to actually make the changes to the documentation.

cleconte987 commented 4 years ago

Yes but there are exceptions that you don't want to lower and that are not in CAPITALIZATION_EXCEPTIONS. What do you do with it? Should you extend it?

datapythonista commented 4 years ago

Yes, the script will validate most cases all right, but if there is anything that need to be changed there, like adding new keywords, you can do it.

datapythonista commented 4 years ago

Better don't open a huge PR, take few documents (e.g. five), and just fix those.if you want to fix more (surly appreciated) then keep opening PRs, no problem in opening many.

Thanks!

cleconte987 commented 4 years ago

I am not very used to git yet, how do I push to remote repository? I have pulled the repository on my local machine. I have modified some files in doc, commit, and doesn't work when I push to the GitHub url. What is the url where I should push to?

datapythonista commented 4 years ago

@cleconte987 you need to open a pull request. It's a bit tricky the first time, but there are resources out there to help you know how it works. If you don't find anything better, you can see these slides https://docs.google.com/presentation/d/1rOSYXZPyMe9KXnbVK_xbJzw_-ijxd6bIxndmvPU6L2o/edit?usp=sharing and this video (sorry the audio is awful): https://www.youtube.com/watch?v=LCTk0leNH1g

tonywu1999 commented 4 years ago

https://dev.pandas.io/docs/development/contributing.html

I started contributing 2 months ago, and I found that this link helped me a lot.

cleconte987 commented 4 years ago

Ok thank you

themien commented 4 years ago

@cleconte987 I am already on the issue. Will do a pull request with all the updated documentation soon

cleconte987 commented 4 years ago

Well, what should I do now? @tonywu1999 @datapythonista. I started to commit to the documentation. I guess you are assignee. Im here if I can help

datapythonista commented 4 years ago

As said early, you should be working on small batches, so keep opening small pull requests with the fixes, and we'll be merging them. There are many titles to fix, try to coordinate if possible, but more than one person can work with this, no problem.

cleconte987 commented 4 years ago

And I think it's not correct to lower words like DataFrame to Dataframe, shouldn't it be kept with capitalization?

datapythonista commented 4 years ago

That can be tricky, but I think DataFrame is the best option.

iahsanujunda commented 4 years ago

Hi, I tried working on this one, and found some thing I would like to confirm. When running: ./scripts/validate_rst_title_capitalization.py doc/source/getting_started/comparison/comparison_with_r.rst it produced: doc/source/getting_started/comparison/comparison_with_r.rst:205:Heading capitalization formatted incorrectly. Please correctly capitalize "|Tapply|" to "Tapply|" doc/source/getting_started/comparison/comparison_with_r.rst:311:Heading capitalization formatted incorrectly. Please correctly capitalize "|Ddply|" to "Ddply|" doc/source/getting_started/comparison/comparison_with_r.rst:424:Heading capitalization formatted incorrectly. Please correctly capitalize "|Cast|" to "Cast|" doc/source/getting_started/comparison/comparison_with_r.rst:497:Heading capitalization formatted incorrectly. Please correctly capitalize "|Factor|" to "Factor|" Should I go ahead and follow this? Cheers

data-RanDan commented 3 years ago

Hi Can I fix some of these?

datapythonista commented 3 years ago

Can I fix some of these?

Sure. Not sure what's the status, I guess @cleconte987 can help you understand what's missing.

rajalakshmi139 commented 3 years ago

Hi, I would like to contribute and fix some of these.

cleconte987 commented 3 years ago

Hello, I think folders doc/source/development and doc/source/reference are done. And doc/source/whatsnew is about to be finished, the rest is up to be done

ShyamDesai commented 3 years ago

take

Ayushihelloworld commented 3 years ago

take

bsun94 commented 3 years ago

Hi everyone, can I join in and help out on this issue in any way?

vrushalit commented 3 years ago

Hey! I'm a new contributor and this is a beginner issue that excites me can someone help me get started with resolving this issue and what should I learn?

datapythonista commented 3 years ago

Feel free to help. Just run the validation script, pick a file, and open a pull request with the fixes to that file. There is documentation on how to contribute to pandas here: https://pandas.pydata.org/docs/development/index.html

kswarj commented 2 years ago

Hey! I am new contributor and i would like to work on this issue

kswarj commented 2 years ago

take

SomtochiUmeh commented 1 year ago

New contributor here 👋🏾.

SomtochiUmeh commented 1 year ago

take

SomtochiUmeh commented 1 year ago

Hey, RadViz should be kept as is right?

SomtochiUmeh commented 1 year ago

Also SpareArray and SparseDtype?

INDIG0N commented 1 year ago

Hey @datapythonista , I was starting to wok on this issue and came across a weird scenario, specifically with the stumpy package mentioned in ecosystem.rst.

So, ecosystem.rst refers to the pckage in all caps, "STUMPY". The script catches this of course and says to correct it to "Stumpy". It loks like the authors of the package refer to it in all caps in their documentation which matches the current capitalization we use, but when importing the package and using it, it's all lowercase.

In situations like this, should I use the capitalization the script suggests, correct the capitalization to all lowercase to match with how it's imported, add the package name to the list of exceptions in the script itself, or the last 2 combined?

datapythonista commented 1 year ago

You can add it to the list of exceptions. Or, if you think it's reasonable and not too complicated, just skip that level of header (probably h3) of the ecosystem page, as everything in it should be a package name if I'm not wrong.

INDIG0N commented 1 year ago

@datapythonista Thanks, I had another question though. it looks like the script is asking me to change the capitalization in one of the urls.

For reference this is the original url: https://github.com/TDAmeritrade/stumpy

it wants me to make the link all lowercase. The link works fine as it is, but weirdly enough putting the link in all lowercase also seems to work fine, and I have no idea why. Is there some kind of weird behavior that means I shouldn't change the capitalization in links or am I good to go?

datapythonista commented 1 year ago

URLs are not case sensitive afaik. So, making the url all lowercase shouldn't be a problem when clicking on it. I guess the capitalization is more for branding, and it'd probably be nice to keep it and don't validate links in the titles. If it doesn't introduce much extra complexity to the validation, and you want to give it a try, that would be great.

harsimran44 commented 9 months ago

can i work on this issue?

suresh33661 commented 9 months ago

take

harsimran44 commented 9 months ago

Take

skregas commented 8 months ago

Hi, can an admin take a look at #55685? Not sure how to make the tests pass. I didn't make any changes to anything that's being tested in the checks. Thanks

kajor3k commented 1 week ago

Hey everyone - I've created my PR. I've seen that this story has been considered to be too wide to squeeze in one PR, hence I covered only 3 most recent files from whatsnew directory for now (2.2.1 didn't require any changes).

In the original comment in that issue, I saw that proposed way of running that script was: ./scripts/validate_rst_title_capitalization.py doc/source, but unfortunately that won't work anymore as the script requires list of strings, so in other words, one need to provide particular files i.e. scripts/validate_rst_title_capitalization.py doc/source/whatsnew/v2.2.1.rst doc/source/whatsnew/v2.2.2.rst

I also tried to reuse exclusions wherever it was possible, i.e. instead of adding "I/O" to the list I've edited rst to use "IO" as the second one was already on the list.

I also think, that there's a need for surpressing some of the validations, and exclusions may not be enough. I.e. - "pandas" is added to exclusions with underscore, however it can also be used at the beginning of the title and then this particular entry in an exclusion doesn't work as expected.

I'll be happy to pick up other files as well and trigger some discussions, but before I do so, I just wanted to confirm with you if that's an expected way of working.

Potential future stories: 1. From what I see, this script has never been turned on on code_checks.sh. That's something I could tackle as well. In order to achieve that, I think the good predecessor story would be to allow this script to run for all files in docs and subdirectories. I see that the validator is being run on the PR, however it is not configured on code_checks.sh

  1. I think some more sophisticated logic for exclusions should be introduced as well. Maybe "rule" approach would be a good choice here? The first example to tackle could be a "pandas" word example I've described above.