rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 883 forks source link

[BUG] Groupby on NaNs takes a long time to complete #1329

Closed trxcllnt closed 4 years ago

trxcllnt commented 5 years ago

Describe the bug Groupby on a column of all NaNs takes a long time to complete.

Steps/Code to reproduce bug

import numpy as np
from cudf.utils.cudautils import arange
from cudf.dataframe import DataFrame, Series

ids = arange(0, 1000000, dtype=np.int32)
nans = np.empty(1000000, dtype=np.float) * np.nan
nans = Series(nans, nan_as_null=False)

%time df.groupby(by='value', as_index=False).count()
###
# CPU times: user 57.8 s, sys: 39.2 s, total: 1min 36s
# Wall time: 1min 36s
###

Expected behavior Groupby completes in a reasonable amount of time.

Environment details (please complete the following information):

lmeyerov commented 5 years ago

Guessing this is pretty pervasive -- real data has nans, either b/c small amount (say 5%) or sparse (95%) -- so unusually high hit here hurts.

Is this all val types, or just floats?

kkraus14 commented 5 years ago

@lmeyerov this is specifically with floats in using NaN as a discrete value as opposed to null

kkraus14 commented 5 years ago

@jrhemstad I imagine this has to do with hash table building and/or probing with regards to NaN values. What are your thoughts?

jrhemstad commented 5 years ago

@trxcllnt does your above example produce a correct result according to Pandas semantics? i.e., does the count equal the size of the column?

Off the top of my head, I wouldn't be surprised if this didn't work at all, since NaN != NaN.

What I believe is happening here is the worst possible case for collisions and chaining w/ an open addressing hash table.

When we attempt to insert NaNs into a hash table, all NaNs will hash to the same location, but because NaN != NaN, it will be treated as a hash collision. With open addressing, when a collision occurs, it will attempt to insert into the next hash table location.

In this way, if you attempt to insert one million NaNs, threads will be stepping all over each other to insert into a long chain of hash table entries. The result is a O(n^2) complexity. This will only occur for the pathological case of a column entirely comprised of NaNs. A column comprised entirely of any other single value will still be slow (due to contention of updating the same hash table location), but will still be O(n) rather than O(n^2).

What I can think of right now is that there's two options:

  1. Add a check for NaN keys by performing the check if(key != key), which is only true for NaN values. If a key is NaN, treat NaN == NaN.
  2. Require that NaNs be treated as null values. (which presupposes that groupby has null value support :smile: )
trxcllnt commented 5 years ago

@jrhemstad not sure what pandas behavior is here, but I'd imagine this is common enough it does the right thing. I just tried running our Malware Hypergraph notebook, and it seems NaNs with some valid values interspersed has actually locked up my process: image

lmeyerov commented 5 years ago

This problem may be broader & worse than initially reported. Impacting non-NaNs too. A boring DF w/ categoricals took 60s+ on just 500K rows.

See reproduction #3 in https://colab.research.google.com/drive/1S5UhplicI4gpEIJz0jQdhGd3gyBAYbee#scrollTo=vpXDxGpAeysg .

jrhemstad commented 5 years ago

This problem may be broader & worse than initially reported. Impacting non-NaNs too. A boring DF w/ categoricals took 60s+ on just 500K rows.

See reproduction #3 in https://colab.research.google.com/drive/1S5UhplicI4gpEIJz0jQdhGd3gyBAYbee#scrollTo=vpXDxGpAeysg .

Pretty confident that categorical issue is orthogonal. As far as the C++ implementation is concerned, a categorical column is no different from an int32 column. Therefore, I suspect something wonky is going on in the Python layer.

thomcom commented 5 years ago

This problem and the problem in #1685 should be separate issues. MultiIndexing, introduced to 0.7 on May 2, 2019, introduced a major performance regression to groupby. I have an improvement for that in https://github.com/rapidsai/cudf/pull/1689.

jrhemstad commented 4 years ago

This was resolved by specializing the row equality comparator in libcudf to treat NaN == NaN.