pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

Add a "Best Practices" document #27483

Open TomAugspurger opened 5 years ago

TomAugspurger commented 5 years ago

I'd like to have a document that describes how we think people should write pandas code.

This introduces a bit of friction when documenting something, since you'll need to decide "does it go in best practices or the user guide?" But I think the idea of a "best practices" document with opinionated, short examples and prose, linking back to the user guide and API docs, is valuable.

I've started a notebook at https://mybinder.org/v2/gh/TomAugspurger/pandas-best-practices/master?filepath=Best%20Practices.ipynb

Are there any sections you would add / remove?

Would you structure it differently?

(tangentially, I'd like to explore how we can incorporate binder into our documentation).

TomAugspurger commented 5 years ago

There's probably a lot of overlap between this and https://github.com/pandas-dev/pandas/issues/26831.

datajanko commented 5 years ago

What about:

If you are searching stackoverflow, still lots of questions do chained indexing.

Additionally, in lots of questions people want to iterate, which most of the times can be avoided using vectorisation, boolean masks etc. I would put this under this under tidy data, since people often just come up with awfully formated data, we could emphasize how easy tasks are if data are well formatted. (Think of lists of strings or tuples in a column)

JMBurley commented 4 years ago

+1 for avoid iterations, boolean masks. From interviews, I can confirm a majority of newbies are bad at both.

On a broader point, I think "how you should write pandas code" falls into two buckets:

I think both are valuable, and a good best-practices document would be helpful for the community at large. Syntactic sugar can be addressed by an opinionated doc with short examples like the airport ones in @TomAugspurger 's notebook; efficiency is best addressed (IMO) with plots showing the runtime/mem footprint of different methods (see https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas for a good example on iterrows).

I am not sure how to give this document the necessary visibility to make it useful, although that is a problem to be solved after there is a defined document that the community thinks is great.

TomAugspurger commented 4 years ago

Probably not happening for 1.0.