rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[BUG] Writing `NA`s in a view results in zeros in the parent dataframe #8754

Open Nyrio opened 3 years ago

Nyrio commented 3 years ago

I don't know exactly if it's a bug or a feature, but I have noticed that we don't have parity with pandas in the following example because NAs are interpreted as zeros when writing in the parent dataframe:

import cudf
import pandas as pd

cudf0 = cudf.DataFrame([[1.], [2.]], columns=["A"])
cudf1 = cudf0[:1]
cudf1["A"] = None

print("cuDF (parent):")
print(cudf0, end='\n\n')
print("cuDF (view):")
print(cudf1, end='\n\n')

pd0 = pd.DataFrame([[1.], [2.]], columns=["A"])
pd1 = pd0[:1]
pd1["A"][0] = None

print("pandas (parent):")
print(pd0, end='\n\n')
print("pandas (view):")
print(pd1, end='\n\n')

Output:

cuDF (parent):
     A
0  0.0
1  2.0

cuDF (view):
      A
0  <NA>

pandas (parent):
     A
0  NaN
1  2.0

pandas (view):
    A
0 NaN

Note that if we use the value NaN instead of None (aka missing value / <NA>), it works like pandas.

shwina commented 3 years ago

Thanks for reporting. Definitely a bug and not a feature :-)

brandon-b-miller commented 2 years ago

Here is what is happening after some debugging. A column may or may not have nulls, and therefore may or may not have an associated mask. When we take a slice of a column, that view may or may not have an associated mask, depending on if the parent column did or did not have one. If the parent column did have a mask, it can be accessed through base_mask.

In this example, we're starting with a column that has no mask and assigning a null. The code detects that the slice does not possess a mask, and so creates a new one, which is then associated with the slice but not the parent mask. The data is written to the data buffer successfully, but the parent column still has no mask. This is why the problem goes away if the original column contains even one null - because then the child column can mutate base_mask and the result will propagate to the parent.

I am looking into a few solutions here but I am worried the result won't be pretty. More to come here.

shwina commented 2 years ago

Moving this issue to the next release, as it likely requires some non-trivial design changes that we won't be able to do well in time for this release.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr commented 4 months ago

The original snippet no longer works exactly as expected because None cannot be assigned directly to the value of a column and needs to be wrapped in a list (or as in the pandas assignment, assigned to a specific element). However, with that fix the underlying issue no longer persists, but it's because copy/view semantics are different and we don't modify in place at all (note that the parent is unchanged in the cudf case)

In [1]: import cudf                                                                                                                                                                                         
   ...: import pandas as pd                                                                                                                                                                                 
   ...:                                                                                                                                                                                                     
   ...: cudf0 = cudf.DataFrame([[1.], [2.]], columns=["A"])                       
   ...: cudf1 = cudf0[:1]                          
   ...: cudf1["A"][0] = None  
   ...:                                            
   ...: print("cuDF (parent):")                                                                                                                                                                             
   ...: print(cudf0, end='\n\n')                                                                      
   ...: print("cuDF (view):")                                                                                                                                                                               
   ...: print(cudf1, end='\n\n')                                                                      
   ...:             
   ...: pd0 = pd.DataFrame([[1.], [2.]], columns=["A"])    
   ...: pd1 = pd0[:1]                                                                                 
   ...: pd1["A"][0] = None                                                                            
   ...:                                                                                                                                                                                                     
   ...: print("pandas (parent):")
   ...: print(pd0, end='\n\n')                                                                        
   ...: print("pandas (view):")
   ...: print(pd1, end='\n\n')
cuDF (parent):
     A                                             
0  1.0        
1  2.0                                        

cuDF (view):                                                                                                                                                                                                
      A                                            
0  <NA>                                       

<ipython-input-1-6df2d8c7d8ef>:15: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to updat
e the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  pd1["A"][0] = None
<ipython-input-1-6df2d8c7d8ef>:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pd1["A"][0] = None
pandas (parent):
     A
0  NaN
1  2.0

pandas (view):
    A
0 NaN

With copy-on-write on, I see a different issue in cudf because it seems like we are modifying something that we shouldn't:

In [1]: import cudf
   ...: cudf.set_option("copy_on_write", True)
   ...: import pandas as pd
   ...: pd.options.mode.copy_on_write = True
   ...: 
   ...: cudf0 = cudf.DataFrame([[1.], [2.]], columns=["A"])
   ...: cudf1 = cudf0[:1]
   ...: cudf1["A"][0] = None
   ...: 
   ...: print("cuDF (parent):")
   ...: print(cudf0, end='\n\n')
   ...: print("cuDF (view):")
   ...: print(cudf1, end='\n\n')
   ...: 
   ...: pd0 = pd.DataFrame([[1.], [2.]], columns=["A"])
   ...: pd1 = pd0[:1]
   ...: pd1["A"][0] = None
   ...: 
   ...: print("pandas (parent):")
   ...: print(pd0, end='\n\n')
   ...: print("pandas (view):")
   ...: print(pd1, end='\n\n')
   ...: 
cuDF (parent):
     A
0  1.0
1  2.0

cuDF (view):
      A
0  <NA>

<ipython-input-1-6f655a450cfc>:17: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment.
When using the Copy-on-Write mode, such chained assignment never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

Try using '.loc[row_indexer, col_indexer] = value' instead, to perform the assignment in a single step.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pd1["A"][0] = None
pandas (parent):
     A
0  1.0
1  2.0

pandas (view):
     A
0  1.0

Note that the so-called "cuDF view" contains NA, which I believe is unexpected here. With copy-on-write on, cudf1["A"][0] = None should have meant that cudf1["A"] is copied the moment element 0 is assigned to. If I'm reading this right, it indicates a bug in cudf's copy-on-write implementation that we should address.

CC @galipremsagar

galipremsagar commented 4 months ago

Yes, I think it is a bug. I can take a look at it for next release (24.08). Can you create an issue for it and assign it to me, Vyas?

vyasr commented 4 months ago

Any reason that we can't just use this issue? I've assigned it to you now.