DOC: section on caveats of storing lists inside DataFrame/Series

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.34k stars 17.81k forks source link

DOC: section on caveats of storing lists inside DataFrame/Series #17027

Open chris-b1 opened 7 years ago

chris-b1 commented 7 years ago

xref to a lot of issues, for example #16864

I think we could use a doc section stating storing nested lists/arrays inside a pandas object is preferred to be avoided, showing the downsides (perf, memory use) and a worked out example of an alternative. This seems to be earned knowledge that many have, but not sure we do a good job stating it clearly.

Closely related, might also benefit from a little section encouraging use of Python core data structures when appropriate.

probably goes here - http://pandas.pydata.org/pandas-docs/stable/gotchas.html

pdpark commented 6 years ago

I'd be happy to take this, just not sure what "a worked out example of an alternative" would look like? I've found a few discussions around storing lists in Dataframe cells and none of them discouraged it. This discussion on Stack Overflow is the only one I've found with alternatives: https://stackoverflow.com/questions/39661198/optimal-way-to-add-small-lists-to-pandas-dataframe. Which is the best option? Or is there another, better option? Thanks.

jreback commented 6 years ago

https://stackoverflow.com/questions/45587778/python-explode-rows-from-panda-dataframe https://stackoverflow.com/questions/44361160/explode-a-csv-in-python https://stackoverflow.com/questions/38428796/how-to-do-lateral-view-explode-in-pandas

FYI, the timings are suspect of course, these examples don't use a large enough frame to actually matter.

https://github.com/pandas-dev/pandas/issues/16538

We should make a small section on this. Also should prob just write .explode :< (note for strings we already have this, its the expand=True option in .str.split()

jreback commented 6 years ago

more refs

https://github.com/pandas-dev/pandas/issues/8517

jreback commented 6 years ago

This is pretty idiomatic / efficient.

(pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name='nearest_neighbors')
     .set_index(['name', 'opponent'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

pdpark commented 6 years ago

I read through the examples in the links, very informative, thanks. I'll put something together and submit a PR.

pdpark commented 6 years ago

Just want to clarify something: this issue was opened with the intent, as I understand it, to document the fact that storing lists in dataframes is not ideal. However, the examples above are all about how to explode lists stored in data frames. Is the recommended approach to create a temporary data frame with lists in order to create the preferred dataframe without lists?

jreback commented 6 years ago

no a long form dataframe is ideal from a performance and idiomatic perspective. those examples are illustrative of what to do if they already have lists

point is that you shouldn’t have them in the first place; if you do then you invariable need to convert them anyways

pdpark commented 6 years ago

This example, also from here: https://stackoverflow.com/a/46161733, seems simpler/easier to understand?

(df.nearest_neighbors.apply(pd.Series) .stack() .reset_index(level=2, drop=True) .to_frame('nearest_neighbors'))

Any reason not to prefer it as the canonical example?

jreback commented 6 years ago

yep that prob would be a nice example

pdpark commented 6 years ago

Cool, thanks.

pdpark commented 6 years ago

I want to include an example of doing an "explosion" without creating an intermediary df with lists in cells. Here's my example - what do you think?

df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3), ('opponent', ['76ers', 'blazers', 'bobcats']), ('attribute x', ['A','B','C']) ]) ))

nn = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3

df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nn)], axis=1)

df3 = (df2.set_index(['name', 'opponent']) .stack() .reset_index(level=2, drop=True) .to_frame('nearest_neighbors')) df3

pdpark commented 6 years ago

Added this change to existing pull request.