[FEA] Make `cudf.pandas` not perform redundant CPU<->GPU transfers if there is no in-place write operations

galipremsagar commented 4 months ago

Is your feature request related to a problem? Please describe. In cudf.pandas we currently move dataframes from CPU to GPU or vice-versa for every step entirely. We can avoid performing transfers all the time by storing the dataframe in both memories and spending time in CPU<->GPU transfers if there are no in-place operations on the frames.


In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.read_parquet(
   ...:     "nyc_parking_violations_2022.parquet",
   ...:     columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
   ...: )

In [4]: %time df.count(axis=0)
CPU times: user 1.41 ms, sys: 4.35 ms, total: 5.75 ms
Wall time: 5.15 ms
Out[4]: 
Registration State       15435607
Violation Description    15435607
Vehicle Body Type        15435607
Issue Date               15435607
Summons Number           15435607
dtype: int64

In [5]: %time df.count(axis=1)
CPU times: user 15.7 s, sys: 1.85 s, total: 17.5 s
Wall time: 16.8 s
Out[5]: 
0           5
1           5
2           5
3           5
4           5
           ..
15435602    5
15435603    5
15435604    5
15435605    5
15435606    5
Length: 15435607, dtype: int64

In [6]: %time df.count(axis=0)
CPU times: user 24 s, sys: 2.43 s, total: 26.4 s
Wall time: 25.3 s
Out[6]: 
Registration State       15435607
Violation Description    15435607
Vehicle Body Type        15435607
Issue Date               15435607
Summons Number           15435607
dtype: int64

In [7]: %time df.count(axis=0)
CPU times: user 0 ns, sys: 3.08 ms, total: 3.08 ms
Wall time: 2.75 ms
Out[7]: 
Registration State       15435607
Violation Description    15435607
Vehicle Body Type        15435607
Issue Date               15435607
Summons Number           15435607
dtype: int64

Notice the df.count(axis=0) in cell 6 taking quite a bit of time to move from CPU to GPU, we can avoid this.

Describe the solution you'd like Maintain two identical copies of dataframe - one in GPU, another in CPU.

Matt711 commented 1 month ago

We could hide this in a new mode, option, or env var ("synchronized-memory-mode" say) for the user. There would be new methods for the proxy object: def _sync_gpu(self) # in-place write op on cpu def _sync_cpu(self) # in-place write op on gpu

And a job queue. job_queue = [op1, op2, ...]

Call _sync_cpu in a separate process. Do operations on the gpu and insert in job_queue. When there's fallback, wait until _sync_cpu finishes(ie. until the job_queue is empty), and then do the operation on the cpu. Now call _sync_gpu to do cpu-->gpu transfer.

cc. @galipremsagar

Matt711 commented 1 month ago

The job queue is filled with in-place write operations. The steps I described before would be for in-place operations. For non-in-place operations, there are no cpu<-->gpu memory transfers. And the operation is tried on the cpu if there's fallback.

rapidsai / cudf

[FEA] Make `cudf.pandas` not perform redundant CPU<->GPU transfers if there is no in-place write operations #15670