rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.16k stars 879 forks source link

[BUG] `to_gpu_array()` is used all over where a copy of data is not needed #1606

Closed kkraus14 closed 4 years ago

kkraus14 commented 5 years ago

Instead of accessing the underlying memory by reference, there's a ton of places that to_gpu_array() is called which always returns a copy of the data. We should go through and update this to only use this function when a copy is actually needed, and use a reference to the memory otherwise.

99% sure this is the reason for #1602

shwina commented 5 years ago

It looks like to_gpu_array() now returns a view:

In [50]: a = cudf.Series([1, 2, 3])                                                                                                                                                     

In [51]: copy = a.data.to_gpu_array()                                                                                                                                                   

In [52]: copy[0] = 5                                                                                                                                                                    

In [53]: print(a)                                                                                                                                                                       
0    5
1    2
2    3
dtype: int64
dantegd commented 5 years ago

@shwina it seems that the behavior of to_gpu_array is actually slightly obscure (had to follow the full layers of code a couple of times) and inconsistent. Rhe issue is exemplified in the following code snippet taken from to_dense_buffer (which to_gpu_array in column.py calls):

    if self.null_count > 0:
            if fillna == 'pandas':
                na_value = self.default_na_value()
                # fill nan
                return self.fillna(na_value)
            else:
                return self._copy_to_dense_buffer()
        else:
            # return a reference for performance reasons, should refactor code
            # to explicitly use mem in the future
            return self.data

as you can see, if there are missing values, then it will return a copy, otherwise return by reference. This inconsistent behavior can be problematic of course (getting unwanted copies as Keith mentioned, or wanting to get a refernce and getting a copy silently).

It seems to me that a good solution would be adding a parameter (copy?) to to_gpu_matrix to be able to get a copy when True, and a reference when False. Or alternatively make the function always return a copy and then having to access the data.mem for a reference. @kkraus14 what do you think? I could use the knowledge of what solution cuDF is going to take for my work in cuML cython cleanup in https://github.com/rapidsai/cuml/pull/612

shwina commented 5 years ago

I'd be in favour of this function always returning a copy of the underlying data. And using mem when a view is needed. Happy to put in a PR addressing this

dantegd commented 5 years ago

@shwina I agree, though with the caveat that being able to access mem (or an equivalent if the column redesign moves away from numba too much) from external libraries (like cuML) should always be possible. We hate unnecessary copies :)

shwina commented 5 years ago

What about exposing the data underlying the Series as a cuda array interface?

The advantage of this approach is that should we change the type supporting cudf columns from device NDArrays to something else, the API can remain the same.

@kkraus14 @thomcom Thoughts?

millerhooks commented 4 years ago

I'm getting back into the pythons and am trying to critically look at this issue. It touches enough stuff that I'm taking a TDD approach. The last comment from @shwina that was liked by @kkraus14 suggests that we maybe should use the underlying Series as a cuda array interface. This sounds like the right way to go, but I'm hesitant to take a single thumbs up as consensus. Are we all cool with that strategy?

It seems like we have a few test cases for return that other libs depend on. Does returning as a cuda array interface cover all the test cases? Can anyone give me function references in the other libs that I need to verify via test case?

Sorry if I'm being obtuse and missing the obvious. I'm still trying to get my sea legs on the project and my favorite way to do that is by asking questions that are probably stupid and wrong. I have a direct example above, but I would love to know of a few specific places in the other libs that depend on this where the behavior is inconsistent or failing. Especially if there are try/excepts detecting or modifying that failure to the desired result.

I have been trying to figure out the priority and dependency graph of the other libs. As far as I understand it custrings is at this point merged into cudf. I'm going to do a search for this function in other libs and start from there, but any feedback you have or favorite examples of this inconsistency in the following libs would be appreciated:

It would help me very much if anyone can show me two comparative uses in another location where this inconsistency effects us and point me at learning why. I can work directly on the conversion to cuda array interface but especially because this effects downstream I'd love to set up a test to get an OK on two things that were a problem in the past outside of cudf or just a couple lines of example.

Also if I'm missing any other external deps that would be useful for the test suite for the solution like BlazingSQL or similar, or even internal libs that I might be missing because I always feel like I don't know where all of them are that would be super helpful.

shwina commented 4 years ago

Just to clarify my comment above, the suggestion was not for mem to return a __cuda_array_interface__, but rather that we should expose a __cuda_array_interface__ attribute so that other libraries can work with the data underlying cudf objects in a copy-free manner.

Since then, we have done just that:

https://github.com/rapidsai/cudf/pull/2464

Now libraries (like CuML) can work with cudf objects without copies, and without having to use either .mem or .to_gpu_array(). The __cuda_array_interface__ should be the preferred way for GPU libraries to interoperate.

From my perspective, this brings back the scope of this issue to just replacing the use of .to_gpu_array() with .mem internally within cudf, wherever possible.

thomcom commented 4 years ago

It looks like .to_gpu_array() is just an alias for buffer.mem[: self.size] in all cases - Series, Index, Column, and Buffer.

shwina commented 4 years ago

The behaviour is different when there are nulls:

In [16]: a = cudf.Series([1, 2, 3, None, 5])                                                                                                                                                                    

In [17]: a._column.data.mem.copy_to_host()                                                                                                                                                                      
Out[17]: array([1, 2, 3, 0, 5])

In [18]: a._column.to_gpu_array().copy_to_host()                                                                                                                                                                
Out[18]: array([1, 2, 3, 5])
kkraus14 commented 4 years ago

Fixed by #3012