Closed simull closed 2 years ago
Hi @simull
Great question.
It all comes down to whether you are working with a view or a copy of the dataframe .
In this case, df_weather.iloc[:, :4]
returns a copy of the original dataframe.
In the line
df_weather.iloc[:, :4].columns = cols
you assign new columns to the copy of the dataframe.
However, you do not assign the copy to any variable, thus it gets lost in the ether. In particular, it is not stored in the variable df_weather
.
Therefore, when you print the head of the dataframe df_weather
in the next line nothing has happened with the columns of this dataframe object.
Your solution that works is not making any copy but directly changing the column names of the original dataframe object using its rename
method.
For the first two solutions to work, I would store the copy returned by df_weather.iloc[:, :4]
in a new variable before changing the copy's columns.
To see that df_weather.iloc[:, :4]
returns a new copy each time it is called, I created a little illustration below using the builtin id
function which returns the "identity" of a given python object.
# Without assigning the copy of the dataframe
print(f"is_view?: {df_weather.iloc[:, [0, 1, 2, 3]]._is_view}") # Not a view but a copy
print(df_weather.iloc[:, :4].head())
print(f"\nid of object: {(id_before := id(df_weather.iloc[:, :4]))}")
df_weather.iloc[:, :4].columns = cols # Also a new copy with a new `id`
print(f"id of object: {(id_after := id(df_weather.iloc[:, :4]))}")
print(f"Are the objects the same?: {id_before == id_after}")
print("\n", df_weather.iloc[:, :4].head()) # Columns haven't changed
## Output:
# is_view?: False
# 0 1 2 3
# 0 AGE00135039 18630101 PRCP 0
# 1 ASN00021014 18630101 PRCP 0
# 2 ASN00023000 18630101 PRCP 0
# 3 ASN00026020 18630101 PRCP 20
# 4 ASN00026026 18630101 PRCP 0
# id of object: 140270804373408
# id of object: 140270804372064
# Are the objects the same?: False
# 0 1 2 3
# 0 AGE00135039 18630101 PRCP 0
# 1 ASN00021014 18630101 PRCP 0
# 2 ASN00023000 18630101 PRCP 0
# 3 ASN00026020 18630101 PRCP 20
# 4 ASN00026026 18630101 PRCP 0
# Assigning the copy of the dataframe to a variable
df_weather_subset = df_weather.iloc[:, :4] # Assign copy to the variable `df_weather_subset`
print(df_weather_subset.head())
print(f"\nid of object: {(id_before := id(df_weather_subset))}")
df_weather_subset.columns = cols # Changing the columns of the new variable with the copy assigned
print(f"id of object: {(id_after := id(df_weather_subset))}")
print(f"Are the objects the same?: {id_before == id_after}")
print("\n", df_weather_subset.head()) # Columns changed
## Output:
0 1 2 3
# 0 AGE00135039 18630101 PRCP 0
# 1 ASN00021014 18630101 PRCP 0
# 2 ASN00023000 18630101 PRCP 0
# 3 ASN00026020 18630101 PRCP 20
# 4 ASN00026026 18630101 PRCP 0
# id of object: 140270804370048
# id of object: 140270804370048
# Are the objects the same?: True
# station datetime obs_type obs_value
# 0 AGE00135039 18630101 PRCP 0
# 1 ASN00021014 18630101 PRCP 0
# 2 ASN00023000 18630101 PRCP 0
# 3 ASN00026020 18630101 PRCP 20
# 4 ASN00026026 18630101 PRCP 0
Cheers Jonas
Dear Jonas,
thank you for the detailed explanation. I did not expect to create a copy with my code. I understand now that it is the iloc
that creates a copy of my dataframe because without it (df.columns = [list of new columns]) will make changes to the original df.
The problem I am facing here is that I do not want to create a subset copy of the dataset because I am losing the remaining columns whose name I do not want to change. They should however remain in the dataset.
I tried several workarounds with merging the subsetted df again with the original and also tried to concatenate the two but ran into other issues I could not solve with those attempts as well.
Best, Simon
Dear @simull
I think I understand your question now.
If you only want to change the first 4 columns and keep the others, without using the rename
method, I would do it as
cols = ['station', 'datetime', 'obs_type', 'obs_value']
df_weather.columns = cols + df_weather.columns[len(cols):].tolist()
print(df_weather.head())
## Output:
# station datetime obs_type obs_value 4 5 6 7
# 0 AGE00135039 18630101 PRCP 0 NaN NaN E NaN
# 1 ASN00021014 18630101 PRCP 0 NaN NaN a NaN
# 2 ASN00023000 18630101 PRCP 0 NaN NaN a NaN
# 3 ASN00026020 18630101 PRCP 20 NaN NaN a NaN
# 4 ASN00026026 18630101 PRCP 0 NaN NaN a NaN
Cheers Jonas
Hi Jonas,
thanks a lot. That was exactly what I meant.
Best, Simon
Dear all,
can someone explain to me, why the following code does not work? When I print the DataFrame's head, the columns have not been renamed. I find another way to do it that works, but I wonder how to do it with df.columns = cols.
Best, Simon
I also tried an alternative selection of the columns, the problem is the same, as expected:
The alternative that works: