rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 888 forks source link

[BUG] String based columns typed as category get hashed with no consistency when using dask_cudf and RMM, leading to later failures #4566

Closed taureandyernv closed 4 years ago

taureandyernv commented 4 years ago

Describe the bug When trying to merge two dataframes on a column of string dtyped to category when using dask_cudf and RMM, the column is hased as int32 and the hashes are not consistent, so the merge, which successes, results in an empty datafream.

This happens on 0.13 3/17 nightlies. cudf works. dask_cudf without RMM works. I am having issues replicating this with a smaller dataset.

Steps/Code to reproduce bug This code will set up the environment, start a client with RMM, read the data using dask, print out the results where you will see that the strings have been hashed, and then attempt a merge on the hashed column, which will result in a 0 row dataframe, although it should be a 115492 row dataframe.

please follow the directions here: https://docs.rapids.ai/datasets/mortgage-data and download the data from: http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz

%env NCCL_P2P_DISABLE=1 # Necessary for NCCL < 2.4
import dask_xgboost as dxgb_gpu
import dask
import dask_cudf
from dask_cuda import LocalCUDACluster
from dask.delayed import delayed
from dask.distributed import Client, wait

import cudf

import pynvml
import numpy as np
import xgboost as xgb

from collections import OrderedDict
import gc
from glob import glob
import os

cluster = LocalCUDACluster(n_workers=2, threads_per_worker=1)  # Please change your n_workers amount to the number of GPUs you have
print(cluster)
client = Client(cluster)
client

def initialize_rmm_pool():
    import rmm

    rmm.reinitialize(pool_allocator=True)

def initialize_rmm_no_pool():
    import rmm

    rmm.reinitialize(pool_allocator=False)

client.run(initialize_rmm_pool)

n_partitions = 10

### Bring in Names data

cols = [
        'seller_name', 'new'
]

dtypes = OrderedDict([
        ("seller_name", "category"),
        ("new", "category"),
])
path = ## where ever you downloaded your data
col_names_path = path + "names.csv"
ncdf = dask_cudf.read_csv(col_names_path, names=cols, delimiter='|', dtype=list(dtypes.values()), header= True, npartitions = n_partitions)

### Bring in Acquisitions
cols = [
        'loan_id', 'orig_channel', 'seller_name', 'orig_interest_rate', 'orig_upb', 'orig_loan_term', 
        'orig_date', 'first_pay_date', 'orig_ltv', 'orig_cltv', 'num_borrowers', 'dti', 'borrower_credit_score', 
        'first_home_buyer', 'loan_purpose', 'property_type', 'num_units', 'occupancy_status', 'property_state',
        'zip', 'mortgage_insurance_percent', 'product_type', 'coborrow_credit_score', 'mortgage_insurance_type', 
        'relocation_mortgage_indicator'
]

dtypes = OrderedDict([
        ("loan_id", "int64"),
        ("orig_channel", "category"),
        ("seller_name", "category"),
        ("orig_interest_rate", "float64"),
        ("orig_upb", "int64"),
        ("orig_loan_term", "int64"),
        ("orig_date", "date"),
        ("first_pay_date", "date"),
        ("orig_ltv", "float64"),
        ("orig_cltv", "float64"),
        ("num_borrowers", "float64"),
        ("dti", "float64"),
        ("borrower_credit_score", "float64"),
        ("first_home_buyer", "category"),
        ("loan_purpose", "category"),
        ("property_type", "category"),
        ("num_units", "int64"),
        ("occupancy_status", "category"),
        ("property_state", "category"),
        ("zip", "int64"),
        ("mortgage_insurance_percent", "float64"),
        ("product_type", "category"),
        ("coborrow_credit_score", "float64"),
        ("mortgage_insurance_type", "float64"),
        ("relocation_mortgage_indicator", "category")
])
path = ## where ever you downloaded your data
acquisition_path= path +"acq/Acquisition_2000Q1.txt"
acdf = dask_cudf.read_csv(acquisition_path, names=cols, delimiter='|', dtype=list(dtypes.values()), header= True, npartitions = n_partitions)

print(acdf['seller_name'].head())
print(ncdf['seller_name'].head())

gcdf = acdf.merge(ncdf, on=['seller_name']) 
gcdf.compute() ### You will get 0 rows

Output: 0 rows x 26 columns. You'll notice that the data is hashed, but there is no consistency in the hashing, so there is nothing to merge together

env: NCCL_P2P_DISABLE=1 # Necessary for NCCL < 2.4
LocalCUDACluster('...) # redacted 

0     976527792
1    1240500859
2      10796916
3     976527792
4    1240500859
Name: seller_name, dtype: int32

0      727548351
1      257623342
2    -1883912064
3    -1452148248
4     -536875386
Name: seller_name, dtype: int32

  | loan_id | orig_channel | seller_name | orig_interest_rate | orig_upb | orig_loan_term | orig_date | first_pay_date | orig_ltv | orig_cltv | ... | num_units | occupancy_status | property_state | zip | mortgage_insurance_percent | product_type | coborrow_credit_score | mortgage_insurance_type | relocation_mortgage_indicator | new
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

0 rows × 26 columns

Expected behavior the output should be like if we change the seller_name column dtype to str, or if we used cudf to read the dataset.

0                                  AMTRUST BANK
1                         BANK OF AMERICA, N.A.
2       BISHOPS GATE RESIDENTIAL MORTGAGE TRUST
3                            CITIMORTGAGE, INC.
4     FIRST TENNESSEE BANK NATIONAL ASSOCIATION
Name: seller_name, dtype: category

0                          ACADEMY MORTGAGE CORPORATION
1                                             ALLY BANK
2                        AMERIHOME MORTGAGE COMPANY|LLC
3                        AMERISAVE MORTGAGE CORPORATION
4                                          AMTRUST BANK
Name: seller_name, Length: 79, dtype: category

and then when you merge, you will get 115492 rows × 26 columns:


loan_id | orig_channel | seller_name | orig_interest_rate | orig_upb | orig_loan_term | orig_date | first_pay_date | orig_ltv | orig_cltv | ... | num_units | occupancy_status | property_state | zip | mortgage_insurance_percent | product_type | coborrow_credit_score | mortgage_insurance_type | relocation_mortgage_indicator | new
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
161380012512 | 1783578239 | OTHER | 7.750 | 162000 | 360 | 1999-12-01 | 2000-02-01 | 80.0 | null | ... | 1 | 1461515023 | -996096470 | 553 | null | -102570089 | 667.0 | null | 2313200 | 1165870684
161385671653 | 1783578239 | OTHER | 7.875 | 112000 | 360 | 2000-01-01 | 2000-03-01 | 80.0 | null | ... | 1 | 1461515023 | 581943325 | 390 | null | -102570089 | 622.0 | null | 2313200 | 1165870684
161397125429 | -1986976006 | OTHER | 8.500 | 125000 | 360 | 2000-02-01 | 2000-04-01 | 95.0 | null | ... | 1 | 1461515023 | 805934690 | 890 | 30.0 | -102570089 | null | 1.0 | 2313200 | 1165870684
161416054213 | 1485901853 | OTHER | 8.250 | 109000 | 360 | 1999-12-01 | 2000-02-01 | 83.0 | null | ... | 1 | 1461515023 | -1513175559 | 748 | 12.0 | -102570089 | 611.0 | 1.0 | 2313200 | 1165870684
161418538714 | 1485901853 | OTHER | 8.000 | 101000 | 360 | 1999-12-01 | 2000-03-01 | 90.0 | null | ... | 1 | 1461515023 | 396261891 | 301 | 25.0 | -102570089 | null | 1.0 | 2313200 | 1165870684
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...
993373637564 | 1485901853 | OTHER | 8.625 | 106000 | 360 | 2000-02-01 | 2000-04-01 | 90.0 | null | ... | 1 | 1461515023 | 1370193142 | 335 | null | -102570089 | null | null | 2313200 | 1165870684
993376917661 | 1485901853 | SUNTRUST MORTGAGE INC. | 7.875 | 95000 | 360 | 1999-10-01 | 1999-12-01 | 90.0 | null | ... | 1 | 1461515023 | 1370193142 | 321 | 25.0 | -102570089 | null | 1.0 | 2313200 | -118088424
993382388361 | -1986976006 | OTHER | 7.875 | 140000 | 360 | 1999-12-01 | 2000-02-01 | 65.0 | null | ... | 1 | 1461515023 | 396261891 | 302 | null | -102570089 | 777.0 | null | 2313200 | 1165870684
993403451980 | 1485901853 | FIRST TENNESSEE BANK NATIONAL ASSOCIATION | 8.500 | 252000 | 360 | 2000-02-01 | 2000-04-01 | 75.0 | null | ... | 1 | 1461515023 | 278806778 | 222 | null | -102570089 | 698.0 | null | 2313200 | 1165870684
993414780181 | 1783578239 | OTHER | 7.625 | 56000 | 180 | 2000-01-01 | 2000-03-01 | 31.0 | null | ... | 1 | 1461515023 | -773461799 | 403 | null | -102570089 | 762.0 | null | 2313200 | 1165870684

115492 rows × 26 columns

Environment overview (please complete the following information) Environment location: [Bare-metal] Method of cuDF install: [conda]

Environment details

# packages in environment at /home/taurean/miniconda3/envs/rapids013-317:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
arrow-cpp                 0.15.0           py36h090bef1_2    conda-forge
attrs                     19.3.0                     py_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
bleach                    3.1.3              pyh8c360ce_0    conda-forge
bokeh                     2.0.0            py36h9f0ad1d_0    conda-forge
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
brotli                    1.0.7             he1b5a44_1000    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2019.11.28           hecc5488_0    conda-forge
cairo                     1.16.0            hfb77d84_1002    conda-forge
certifi                   2019.11.28       py36h9f0ad1d_1    conda-forge
cfitsio                   3.470                hb60a0a2_2    conda-forge
click                     7.1.1              pyh8c360ce_0    conda-forge
cloudpickle               1.3.0                      py_0    conda-forge
cudatoolkit               10.1.243             h6bb024c_0    nvidia
cudf                      0.13.0a200317         py36_4200    rapidsai-nightly
cudnn                     7.6.0                cuda10.1_0    nvidia
cugraph                   0.13.0a200317          py36_386    rapidsai-nightly
cuml                      0.13.0a200317   cuda10.1_py36_1494    rapidsai-nightly
cupy                      7.2.0            py36h0c141eb_1    conda-forge
curl                      7.68.0               hf8cf82a_0    conda-forge
cuspatial                 0.13.0a200207            py36_7    rapidsai-nightly
cytoolz                   0.10.1           py36h516909a_0    conda-forge
dask                      2.12.0                     py_0    conda-forge
dask-core                 2.12.0                     py_0    conda-forge
dask-cuda                 0.13.0b200317           py36_69    rapidsai-nightly
dask-cudf                 0.13.0a200317         py36_4200    rapidsai-nightly
dask-xgboost              0.2.0.dev28      cuda10.1py36_0    rapidsai-nightly
decorator                 4.4.2                      py_0    conda-forge
defusedxml                0.6.0                      py_0    conda-forge
distributed               2.12.0                   py36_0    conda-forge
dlpack                    0.2                  he1b5a44_1    conda-forge
double-conversion         3.1.5                he1b5a44_2    conda-forge
entrypoints               0.3             py36h9f0ad1d_1001    conda-forge
expat                     2.2.9                he1b5a44_2    conda-forge
fastavro                  0.22.13          py36h8c4c3a4_1    conda-forge
fastrlock                 0.4             py36h831f99a_1001    conda-forge
fontconfig                2.13.1            h86ecdb6_1001    conda-forge
freetype                  2.10.0               he983fc9_1    conda-forge
freexl                    1.0.5             h14c3975_1002    conda-forge
fsspec                    0.6.2                      py_0    conda-forge
gdal                      2.4.4            py36h5f563d9_0    conda-forge
geos                      3.8.0                he1b5a44_1    conda-forge
geotiff                   1.5.1                h38872f0_8    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
gflags                    2.2.2             he1b5a44_1002    conda-forge
giflib                    5.1.7                h516909a_1    conda-forge
glib                      2.58.3          py36hd3ed26a_1003    conda-forge
glog                      0.4.0                he1b5a44_1    conda-forge
grpc-cpp                  1.23.0               h18db393_0    conda-forge
hdf4                      4.2.13            hf30be14_1003    conda-forge
hdf5                      1.10.5          nompi_h3c11f04_1104    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
icu                       64.2                 he1b5a44_1    conda-forge
importlib-metadata        1.5.0            py36h9f0ad1d_1    conda-forge
importlib_metadata        1.5.0                         1    conda-forge
ipykernel                 5.1.4            py36h5ca1d4c_0    conda-forge
ipython                   7.13.0           py36h5ca1d4c_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.16.0           py36h9f0ad1d_1    conda-forge
jinja2                    2.11.1                     py_0    conda-forge
joblib                    0.14.1                     py_0    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
json-c                    0.13.1            h14c3975_1001    conda-forge
json5                     0.9.0                      py_0    conda-forge
jsonschema                3.2.0            py36h9f0ad1d_1    conda-forge
jupyter_client            6.0.0                      py_0    conda-forge
jupyter_core              4.6.3            py36h9f0ad1d_1    conda-forge
jupyterlab                2.0.1                      py_0    conda-forge
jupyterlab_server         1.0.7                      py_0    conda-forge
kealib                    1.4.12               hec59c27_0    conda-forge
krb5                      1.16.4               h2fd8d38_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_0    conda-forge
libblas                   3.8.0               16_openblas    conda-forge
libcblas                  3.8.0               16_openblas    conda-forge
libcudf                   0.13.0a200317     cuda10.1_4200    rapidsai-nightly
libcugraph                0.13.0a200317      cuda10.1_386    rapidsai-nightly
libcuml                   0.13.0a200317     cuda10.1_1494    rapidsai-nightly
libcumlprims              0.13.0a200313       cuda10.1_11    rapidsai-nightly
libcurl                   7.68.0               hda55be3_0    conda-forge
libcuspatial              0.13.0a200316       cuda10.1_19    rapidsai-nightly
libdap4                   3.20.4               hd3bb157_0    conda-forge
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgdal                   2.4.4                h2b6fda6_0    conda-forge
libgfortran-ng            7.3.0                hdf63c60_5    conda-forge
libhwloc                  2.1.0                h3c4fd83_0    conda-forge
libiconv                  1.15              h516909a_1005    conda-forge
libkml                    1.3.0             h4fcabce_1010    conda-forge
liblapack                 3.8.0               16_openblas    conda-forge
libllvm8                  8.0.1                hc9558a2_0    conda-forge
libnetcdf                 4.7.3           nompi_h9f9fd6a_101    conda-forge
libnvstrings              0.13.0a200317     cuda10.1_4200    rapidsai-nightly
libopenblas               0.3.9                h5ec1e0e_0    conda-forge
libpng                    1.6.37               hed695b0_0    conda-forge
libpq                     12.2                 hae5116b_0    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
librmm                    0.13.0a200318      cuda10.1_567    rapidsai-nightly
libsodium                 1.0.17               h516909a_0    conda-forge
libspatialite             4.3.0a            ha48a99a_1034    conda-forge
libssh2                   1.8.2                h22169c7_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libtiff                   4.1.0                hfc65ed5_0    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxgboost                1.0.2dev.rapidsai0.13      cuda10.1_5    rapidsai-nightly
libxml2                   2.9.10               hee79883_0    conda-forge
llvm-openmp               9.0.1                hc9558a2_2    conda-forge
llvmlite                  0.30.0           py36h8b12597_1    conda-forge
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
markupsafe                1.1.1            py36h8c4c3a4_1    conda-forge
mistune                   0.8.4           py36h516909a_1000    conda-forge
msgpack-python            1.0.0            py36hdb11119_1    conda-forge
nbconvert                 5.6.1                    py36_0    conda-forge
nbformat                  5.0.4                      py_0    conda-forge
nccl                      2.5.7.1              h51cf6c1_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
notebook                  6.0.3                    py36_0    conda-forge
numba                     0.46.0           py36hb3f55d8_1    conda-forge
numpy                     1.18.1           py36h95a1406_0    conda-forge
nvstrings                 0.13.0a200317         py36_4200    rapidsai-nightly
olefile                   0.46                       py_0    conda-forge
openjpeg                  2.3.1                h981e76c_3    conda-forge
openssl                   1.1.1d               h516909a_0    conda-forge
packaging                 20.1                       py_0    conda-forge
pandas                    0.25.3           py36hb3f55d8_0    conda-forge
pandoc                    2.9.2                         0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.6.2                      py_0    conda-forge
partd                     1.1.0                      py_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
pexpect                   4.8.0            py36h9f0ad1d_1    conda-forge
pickleshare               0.7.5           py36h9f0ad1d_1001    conda-forge
pillow                    7.0.0            py36h8328e55_1    conda-forge
pip                       20.0.2                     py_2    conda-forge
pixman                    0.38.0            h516909a_1003    conda-forge
poppler                   0.67.0               h14e79db_8    conda-forge
poppler-data              0.4.9                         1    conda-forge
postgresql                12.2                 hf1211e9_0    conda-forge
proj                      6.3.0                hc80f0dc_0    conda-forge
prometheus_client         0.7.1                      py_0    conda-forge
prompt-toolkit            3.0.4                      py_0    conda-forge
prompt_toolkit            3.0.4                         0    conda-forge
psutil                    5.7.0            py36h8c4c3a4_1    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
py-xgboost                1.0.2dev.rapidsai0.13  cuda10.1py36_5    rapidsai-nightly
pyarrow                   0.15.0           py36h8b68381_1    conda-forge
pygments                  2.6.1                      py_0    conda-forge
pynvml                    8.0.4                      py_0    conda-forge
pyparsing                 2.4.6                      py_0    conda-forge
pyrsistent                0.15.7           py36h8c4c3a4_1    conda-forge
python                    3.6.10          h9d8adfe_1009_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.6                     1_cp36m    conda-forge
pytz                      2019.3                     py_0    conda-forge
pyyaml                    5.3              py36h8c4c3a4_1    conda-forge
pyzmq                     19.0.0           py36h9947dbf_1    conda-forge
rapids                    0.13.0          cuda10.1_py36_116    rapidsai-nightly
rapids-xgboost            0.13.0          cuda10.1_py36_116    rapidsai-nightly
re2                       2020.03.03           he1b5a44_0    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
rmm                       0.13.0a200318          py36_567    rapidsai-nightly
scikit-learn              0.22.2.post1     py36hcdab131_0    conda-forge
scipy                     1.4.1            py36h921218d_0    conda-forge
send2trash                1.5.0                      py_0    conda-forge
setuptools                46.0.0           py36h9f0ad1d_2    conda-forge
six                       1.14.0                     py_1    conda-forge
snappy                    1.1.8                he1b5a44_1    conda-forge
sortedcontainers          2.1.0                      py_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
tblib                     1.6.0                      py_0    conda-forge
terminado                 0.8.3            py36h9f0ad1d_1    conda-forge
testpath                  0.4.4                      py_0    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
toolz                     0.10.0                     py_0    conda-forge
tornado                   6.0.4            py36h8c4c3a4_1    conda-forge
traitlets                 4.3.3            py36h9f0ad1d_1    conda-forge
typing_extensions         3.7.4.1          py36h9f0ad1d_1    conda-forge
tzcode                    2019a             h516909a_1002    conda-forge
ucx                       1.7.0+g9d06c3a       cuda10.1_0    rapidsai-nightly
ucx-py                    0.13.0a200317+g9d06c3a         py36_76    rapidsai-nightly
uriparser                 0.9.3                he1b5a44_1    conda-forge
wcwidth                   0.1.8                      py_0    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xerces-c                  3.2.2             h8412b87_1004    conda-forge
xgboost                   1.0.2dev.rapidsai0.13  cuda10.1py36_5    rapidsai-nightly
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.9                h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
yaml                      0.2.2                h516909a_1    conda-forge
zeromq                    4.3.2                he1b5a44_2    conda-forge
zict                      2.0.0                      py_0    conda-forge
zipp                      3.1.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge
zstd                      1.4.3                h3b9ef0a_0    conda-forge

Additional context This affects the completion of the mortgage notebook for 0.13 release. While working through this, i found https://github.com/rapidsai/cudf/issues/4565. It is related, but not the same issue.

@pentschev @randerzander @rnyak

kkraus14 commented 4 years ago

Duplicate of #3960