isdsucph / isds2022

Introduction to Social Data Science 2022 - a summer school course https://isdsucph.github.io/isds2022/
MIT License
21 stars 23 forks source link

0.4.2 Why does my code not work? #24

Closed simull closed 2 years ago

simull commented 2 years ago

Dear all,

can someone explain to me, why the following code does not work? When I print the DataFrame's head, the columns have not been renamed. I find another way to do it that works, but I wonder how to do it with df.columns = cols.

Best, Simon

cols = ['station', 'datetime', 'obs_type', 'obs_value']
df_weather.iloc[:, :4].columns = cols
df_weather.head()

I also tried an alternative selection of the columns, the problem is the same, as expected:

cols = ['station', 'datetime', 'obs_type', 'obs_value']
df_weather[[0,1,2,3]].columns = cols
df_weather.head()

The alternative that works:

col_dict = {0:'station', 1:'datetime', 2:'obs_type', 3:'obs_value'}
df_weather.rename(columns = col_dict, inplace = True)
isdsucph commented 2 years ago

Hi @simull

Great question. It all comes down to whether you are working with a view or a copy of the dataframe . In this case, df_weather.iloc[:, :4] returns a copy of the original dataframe. In the line

df_weather.iloc[:, :4].columns = cols

you assign new columns to the copy of the dataframe. However, you do not assign the copy to any variable, thus it gets lost in the ether. In particular, it is not stored in the variable df_weather. Therefore, when you print the head of the dataframe df_weather in the next line nothing has happened with the columns of this dataframe object.

Your solution that works is not making any copy but directly changing the column names of the original dataframe object using its rename method.

For the first two solutions to work, I would store the copy returned by df_weather.iloc[:, :4] in a new variable before changing the copy's columns.

Small illustration (read if interested)

To see that df_weather.iloc[:, :4] returns a new copy each time it is called, I created a little illustration below using the builtin id function which returns the "identity" of a given python object.

# Without assigning the copy of the dataframe 
print(f"is_view?: {df_weather.iloc[:, [0, 1, 2, 3]]._is_view}")  # Not a view but a copy 
print(df_weather.iloc[:, :4].head())
print(f"\nid of object: {(id_before := id(df_weather.iloc[:, :4]))}")
df_weather.iloc[:, :4].columns = cols  # Also a new copy with a new `id` 
print(f"id of object: {(id_after := id(df_weather.iloc[:, :4]))}")
print(f"Are the objects the same?: {id_before == id_after}")
print("\n", df_weather.iloc[:, :4].head())  # Columns haven't changed

## Output:
# is_view?: False
#              0         1     2   3
# 0  AGE00135039  18630101  PRCP   0
# 1  ASN00021014  18630101  PRCP   0
# 2  ASN00023000  18630101  PRCP   0
# 3  ASN00026020  18630101  PRCP  20
# 4  ASN00026026  18630101  PRCP   0

# id of object: 140270804373408
# id of object: 140270804372064
# Are the objects the same?: False

#               0         1     2   3
# 0  AGE00135039  18630101  PRCP   0
# 1  ASN00021014  18630101  PRCP   0
# 2  ASN00023000  18630101  PRCP   0
# 3  ASN00026020  18630101  PRCP  20
# 4  ASN00026026  18630101  PRCP   0
# Assigning the copy of the dataframe to a variable 
df_weather_subset = df_weather.iloc[:, :4]  # Assign copy to the variable `df_weather_subset`
print(df_weather_subset.head())
print(f"\nid of object: {(id_before := id(df_weather_subset))}")
df_weather_subset.columns = cols  # Changing the columns of the new variable with the copy assigned 
print(f"id of object: {(id_after := id(df_weather_subset))}")
print(f"Are the objects the same?: {id_before == id_after}")
print("\n", df_weather_subset.head())  # Columns changed 

## Output:
             0         1     2   3
# 0  AGE00135039  18630101  PRCP   0
# 1  ASN00021014  18630101  PRCP   0
# 2  ASN00023000  18630101  PRCP   0
# 3  ASN00026020  18630101  PRCP  20
# 4  ASN00026026  18630101  PRCP   0

# id of object: 140270804370048
# id of object: 140270804370048
# Are the objects the same?: True

#         station  datetime obs_type  obs_value
# 0  AGE00135039  18630101     PRCP          0
# 1  ASN00021014  18630101     PRCP          0
# 2  ASN00023000  18630101     PRCP          0
# 3  ASN00026020  18630101     PRCP         20
# 4  ASN00026026  18630101     PRCP          0

Cheers Jonas

simull commented 2 years ago

Dear Jonas,

thank you for the detailed explanation. I did not expect to create a copy with my code. I understand now that it is the iloc that creates a copy of my dataframe because without it (df.columns = [list of new columns]) will make changes to the original df.

The problem I am facing here is that I do not want to create a subset copy of the dataset because I am losing the remaining columns whose name I do not want to change. They should however remain in the dataset.

I tried several workarounds with merging the subsetted df again with the original and also tried to concatenate the two but ran into other issues I could not solve with those attempts as well.

Best, Simon

isdsucph commented 2 years ago

Dear @simull

I think I understand your question now. If you only want to change the first 4 columns and keep the others, without using the rename method, I would do it as

cols = ['station', 'datetime', 'obs_type', 'obs_value']
df_weather.columns = cols + df_weather.columns[len(cols):].tolist()
print(df_weather.head())

## Output:
#        station  datetime obs_type  obs_value    4    5  6   7
# 0  AGE00135039  18630101     PRCP          0  NaN  NaN  E NaN
# 1  ASN00021014  18630101     PRCP          0  NaN  NaN  a NaN
# 2  ASN00023000  18630101     PRCP          0  NaN  NaN  a NaN
# 3  ASN00026020  18630101     PRCP         20  NaN  NaN  a NaN
# 4  ASN00026026  18630101     PRCP          0  NaN  NaN  a NaN

Cheers Jonas

simull commented 2 years ago

Hi Jonas,

thanks a lot. That was exactly what I meant.

Best, Simon