isdsucph / isds2021

Introduction to Social Data Science 2021 - a summer school course https://isdsucph.github.io/isds2021/
MIT License
22 stars 37 forks source link

Dataframes and "view" #19

Closed johankll closed 3 years ago

johankll commented 3 years ago

I am confused about changes made in view. In the assignment, it says: "If you work on a subset of data from another dataframe, then this dataframe is what is known as a view! Therefore, all changes made in the view will also be made in the original version.

In the example below, we try to change the dataframe df2 which is a view of df3, and we get a warning. Thus, changes to df3 also happen in df2. Notice that we can also use loc for changing the data."

I think this is slightly confusing... And perhaps the text contains some errors?

Anyways... When i execute an example to illustrate this, it dost not seem to be the case that changes made to the subset (df3) affects the original version (df2). Below is a simple example that I made up, where you can see that the change affects df3 but not df2. Am I misunderstanding something? image image

johankll commented 3 years ago

Update: It seems that the change does happen in df2 IF i the > in cell "In [173]" is changes to a -.

I find this very confusing.

Also: I think the example in the assignment is broken, because we change the E-column in df2 to a boolean variable which causes a discrepancy between df2 and df3. As you can see from the below screenshot, this causes df2 to be unaffected by the change to df3. Unless I am doing something wrong... image

joachimkrasmussen commented 3 years ago

Hi Johan,

I'll start with the additional comment in the second post: It seems that you have changed df2['F'] = df2['A'] > df2['D'] to df2['E'] = df2['A'] > df2['D'], right? If you change this back, you will hopefully see that the assignment is not broken.

Regarding the rest of the question, the connection between the two dataframes has to do with the dtype. Consider the following example where I change the float values in column C to some other float values:

df3.loc[:,'C'] = 1.0
print(df3.head(4), '\n')
print(df2.head(4), '\n')

If you run this, you will see that both dataframes are changed. However, when you run:

df3.loc[:,'C'] = 1
print(df3.head(4), '\n')
print(df2.head(4), '\n')

Only df3 is changed. This is because you changed the dtype from float to integer (which also answers your question about using - relative to >). It is a technicality that you can read more about in this post.

Fundamentally, the take-away from this exercise is: Be careful when working with views.

Did this answer your question?

Best, Joachim

johankll commented 3 years ago

Hi Joachim,

Thank you for the explanation and example!