pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

Having problem with 'pandas.DataFrame.loc' for multi type and size dataframes. #32372

Open navidzol opened 4 years ago

navidzol commented 4 years ago

The problem: When I make a data frame with pandas that has columns with different length and type, I cannot re-assign an element inside it using loc. I can explain it better with an example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [np.array([1,2,3]), np.array([4,5,6]), np.array([7,8,9]), np.array([10,11,12]), np.array([13,14,15])], 'b':[5,5,12,123,5]})

df.loc[2,'a']= np.array([53,23,4])

The error that I receive is :

Traceback (most recent call last):
  File "...", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-67-4741bddaf261>", line 1, in <module>
    df.loc[0,'a']= np.array([53,23,4])
  File "...", line 205, in __setitem__
    self._setitem_with_indexer(indexer, value)
  File "...", line 547, in _setitem_with_indexer
    "Must have equal len keys and value "

ValueError: Must have equal len keys and value when setting with an iterable

Now If I remove the second column ('b') and make my data frame as

df = pd.DataFrame({'a': [np.array([1,2,3]), np.array([4,5,6]), np.array([7,8,9]), np.array([10,11,12]), np.array([13,14,15])]})

df.loc[2,'a']= np.array([53,23,4])

I will not get any error. I also noticed that doing the assignment with

df['a'][2]= np.array([53,23,4])

will do the work but throw a "SettingWithCopyWarning". However, I can go around the whole thing and use the following code to not get the error or the warning (thanks to ALS777 from stackoverflow):

df.at[2,'a'] = np.array([52,23,34])

Problem description

I am guessing that it should work with loc if it works with direct indexing like df['a'][2]. Also, why changing the other column effect the process of updating another column?

My specs are:

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-40-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 42.0.2.post20191203 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.1 numexpr : 2.7.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.2 sqlalchemy : None tables : 3.6.1 xarray : None xlrd : None xlwt : None xlsxwriter : None

TomAugspurger commented 4 years ago

In general, pandas doesn't work with with "nested" data like this.

As a workaround, using at works

In [15]: df.at[2,'a']= np.array([53,23,4])

In [16]: df
Out[16]:
              a    b
0     [1, 2, 3]    5
1     [4, 5, 6]    5
2   [53, 23, 4]   12
3  [10, 11, 12]  123
4  [13, 14, 15]    5

I don't think this is likely to be fixed anytime soon.