pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.68k stars 17.91k forks source link

question: filling missing values in one column with values from another column #4420

Closed bluefir closed 11 years ago

bluefir commented 11 years ago

Is there a more concise way of accomplishing the following?

portfolio_data[FIELD_NAME_COUNTRY] = portfolio_data[FIELD_NAME_ISSUER_COUNTRY]
countries = portfolio_data[FIELD_NAME_COUNTRY]
flag_no_country = countries.isnull()
countries[flag_no_country] = portfolio_data.ix[flag_no_country, FIELD_NAME_MSCI_COUNTRY]
portfolio_data[FIELD_NAME_COUNTRY] = countries
jreback commented 11 years ago

You don't need to mask the rhs of the expression, it will align directly

In [30]: df = DataFrame(dict(country = ['A','B',np.nan,'D'],issuer=['Ai','Bi','Ci','D']))

In [31]: df
Out[31]: 
  country issuer
0       A     Ai
1       B     Bi
2     NaN     Ci
3       D      D

In [32]: df.loc[df['country'].isnull(),'country'] = df['issuer']

In [33]: df
Out[33]: 
  country issuer
0       A     Ai
1       B     Bi
2      Ci     Ci
3       D      D
bluefir commented 11 years ago

Thanks!

bluefir commented 11 years ago

Look at what I get

In [12]:

column_country = TopContributors.FIELD_NAME_COUNTRY
column_issuer_country = TopContributors.FIELD_NAME_ISSUER_COUNTRY
column_msci_country = TopContributors.FIELD_NAME_MSCI_COUNTRY
In [13]:

country_columns = [column_issuer_country, column_msci_country, column_country]
In [14]:

portfolio_data.loc[portfolio_data['issuer_country'].isnull(), country_columns]
Out[14]:
         issuer_country msci_country country
25386810            NaN           US     NaN
31374720            NaN           US     NaN
42217K10            NaN           US     NaN
49446R10            NaN           US     NaN
53117210            NaN           US     NaN
620088              NaN           AU     NaN
620884              NaN           ZA     NaN
624730              NaN           AU     NaN
72925110            NaN           US     NaN
74340W10            NaN           US     NaN
74460D10            NaN           US     NaN
75884910            NaN           US     NaN
90265310            NaN           US     NaN
92904210            NaN           US     NaN
B0RF60              NaN           AU     NaN
B1G8KX              NaN           BR     NaN
B27WYK              NaN           BR     NaN
B4TXDZ              NaN           HK     NaN
B70DWB              NaN           AU     NaN
In [15]:

portfolio_data.loc[portfolio_data[column_country].isnull(), column_country] = portfolio_data[column_msci_country]
In [16]:

portfolio_data.loc[portfolio_data[column_issuer_country].isnull(), country_columns]
Out[16]:
         issuer_country msci_country country
25386810            NaN           US      US
31374720            NaN           US      US
42217K10            NaN           US      US
49446R10            NaN           US      US
53117210            NaN           US      US
620088              NaN           AU      US
620884              NaN           ZA      US
624730              NaN           AU      US
72925110            NaN           US      US
74340W10            NaN           US      US
74460D10            NaN           US      US
75884910            NaN           US      US
90265310            NaN           US      US
92904210            NaN           US      US
B0RF60              NaN           AU      US
B1G8KX              NaN           BR      US
B27WYK              NaN           BR      US
B4TXDZ              NaN           HK      US
B70DWB              NaN           AU      US
bluefir commented 11 years ago

This works

In [15]:

#portfolio_data.loc[portfolio_data[column_country].isnull(), column_country] = portfolio_data[column_msci_country]
countries = portfolio_data[column_country]
flag_no_country = countries.isnull()
countries[flag_no_country] = portfolio_data.loc[flag_no_country, column_msci_country]
portfolio_data[column_country] = countries
In [16]:

portfolio_data.loc[portfolio_data[column_issuer_country].isnull(), country_columns]
Out[16]:
         issuer_country msci_country country
25386810            NaN           US      US
31374720            NaN           US      US
42217K10            NaN           US      US
49446R10            NaN           US      US
53117210            NaN           US      US
620088              NaN           AU      AU
620884              NaN           ZA      ZA
624730              NaN           AU      AU
72925110            NaN           US      US
74340W10            NaN           US      US
74460D10            NaN           US      US
75884910            NaN           US      US
90265310            NaN           US      US
92904210            NaN           US      US
B0RF60              NaN           AU      AU
B1G8KX              NaN           BR      BR
B27WYK              NaN           BR      BR
B4TXDZ              NaN           HK      HK
B70DWB              NaN           AU      AU

I have HDF5 file with the data if you are interested. Can I attach it to this thread?

jreback commented 11 years ago

put it on a public site (like dropbox), and put the shared link here

bluefir commented 11 years ago

https://www.dropbox.com/s/49smfvkfsshk1kh/top_contributors%20-%20Copy.h5

jreback commented 11 years ago

so you are trying to fill in 'msci_country' with country if its NaN, right?

jreback commented 11 years ago

Does this not do what you want?

In [13]: df
Out[13]: 
<class 'pandas.core.frame.DataFrame'>
Index: 15504 entries, 000312 to Y8565N10
Data columns (total 11 columns):
MarketCap           15503  non-null values
alpha               15482  non-null values
gics_code           15503  non-null values
investable          15504  non-null values
issuer_country      15485  non-null values
msci_country        11019  non-null values
universe            15504  non-null values
weight_benchmark    15504  non-null values
weight_portfolio    15504  non-null values
weight_active       15504  non-null values
country             15485  non-null values
dtypes: bool(2), float64(6), object(3)

In [14]: df.loc[df['msci_country'].isnull(),'msci_country'] = df['country']

In [15]: df
Out[15]: 
<class 'pandas.core.frame.DataFrame'>
Index: 15504 entries, 000312 to Y8565N10
Data columns (total 11 columns):
MarketCap           15503  non-null values
alpha               15482  non-null values
gics_code           15503  non-null values
investable          15504  non-null values
issuer_country      15485  non-null values
msci_country        15504  non-null values
universe            15504  non-null values
weight_benchmark    15504  non-null values
weight_portfolio    15504  non-null values
weight_active       15504  non-null values
country             15485  non-null values
dtypes: bool(2), float64(6), object(3)

In [16]: df.loc['Y8565N10']
Out[16]: 
MarketCap                 234.05
alpha                  -3.799255
gics_code           1.010204e+07
investable                  True
issuer_country                US
msci_country                  US
universe                    True
weight_benchmark               0
weight_portfolio               0
weight_active                  0
country                       US
Name: Y8565N10, dtype: object
bluefir commented 11 years ago

I am trying to do the opposite:

df.loc[df['country'].isnull(),'country'] = df['msci_country']
jreback commented 11 years ago

seems ok to me are you on 0.11? (there was a bug related to this), it prob assigns to all the same value

try using 0.12

In [6]: df.loc[df['country'].isnull(),'country'] = df['msci_country']

In [7]: df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Index: 15504 entries, 000312 to Y8565N10
Data columns (total 11 columns):
MarketCap           15503  non-null values
alpha               15482  non-null values
gics_code           15503  non-null values
investable          15504  non-null values
issuer_country      15485  non-null values
msci_country        11019  non-null values
universe            15504  non-null values
weight_benchmark    15504  non-null values
weight_portfolio    15504  non-null values
weight_active       15504  non-null values
country             15504  non-null values
dtypes: bool(2), float64(6), object(3)
bluefir commented 11 years ago

Yes, I was on 0.11. It is indeed fixed with 0.12. I switched but went back to 0.11 because of a regression in 0.12 for DataFrame().hist() display. Switching to 0.12 now! Thanks!

jreback commented 11 years ago

fyi....I believe the .hist is just fixed in latest master IIRC, (and windows builds should be on dev site soon)

bluefir commented 11 years ago

yep, it was fixed. waiting for the new build...