Open hassanshamji opened 4 years ago
Tagging @Garfounkel as the output should be mirrored to the input. Also, we should be returning the same number of partitions, right?
@hassanshamji while we look into the input/output mismatch, we could try making the current code work. Before you do out.compute_chunk_sizes()
, could you try out.rechunk((int(out.shape[0] / 3) + 1, -1)
?
The number of partitions is correct, but in order to have the same behavior as scikit-learn we transpose the resulting one hot encoded matrix so the final shape is (number_of_samples, categories)
, or in your case (12, 8)
. Therefore the indexing you want to use to retrieve a columns is out[:, col_number]
instead of out[col_number]
.
out[:, 0].npartitions # 3
ddf['temp'] = out[:, 0] # no error
As for the output type mirroring the input. Again, we chose to mimic sklearn's behavior which is to return an array whatever the input's type. We could change this if needed, but then we'd diverge from sklearn. As you pointed out, if you need the output as a dataframe
:
out.compute_chunk_sizes()
out_ddf = out.to_dask_dataframe(columns=cudf.concat(ohe.categories_).tolist())
@Garfounkel thanks for looking into this! It doesn't seem like a bug anymore then. @hassanshamji if you confirm the above works for you, I will go ahead and close this issue
Thanks for the quick follow-up, @divyegala & @Garfounkel.
Regarding the input/output type, returning an array seems good. I was trying to highlight that it was an array of cupy.ndarray, rather than what I would have expected, an array of cudfs. I could be expecting the wrong thing, but just wanted to clarify that point.
Thanks for the response, @Garfounkel. When I ran ddf['temp'] = out[:, 0]
I got the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-c2aec3acf4dc> in <module>
1 out[:, 0].npartitions # 3
----> 2 ddf['temp'] = out[:, 0] # no error
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/dataframe/core.py in __setitem__(self, key, value)
3488 df = self.assign(**{k: value for k in key})
3489 else:
-> 3490 df = self.assign(**{key: value})
3491
3492 self.dask = df.dask
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
3750 )
3751 )
-> 3752 kwargs[k] = from_dask_array(v, index=self.index)
3753
3754 pairs = list(sum(kwargs.items(), ()))
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/dataframe/io/io.py in from_dask_array(x, columns, index)
414 dask.dataframe._Frame.to_records: Reverse conversion
415 """
--> 416 meta = _meta_from_array(x, columns, index)
417
418 if x.ndim == 2 and len(x.chunks[1]) > 1:
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/dataframe/io/io.py in _meta_from_array(x, columns, index)
48 elif x.ndim == 1:
49 if np.isscalar(columns) or columns is None:
---> 50 return pd.Series([], name=columns, dtype=x.dtype, index=index)
51 elif len(columns) == 1:
52 return pd.DataFrame(
~/.conda/envs/gpu_env/lib/python3.6/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
213
214 if index is not None:
--> 215 index = ensure_index(index)
216
217 if data is None:
~/.conda/envs/gpu_env/lib/python3.6/site-packages/pandas/core/indexes/base.py in ensure_index(index_like, copy)
5742 return index_like
5743 if hasattr(index_like, "name"):
-> 5744 return Index(index_like, name=index_like.name, copy=copy)
5745
5746 if is_iterator(index_like):
~/.conda/envs/gpu_env/lib/python3.6/site-packages/pandas/core/indexes/base.py in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs)
515
516 elif hasattr(data, "__array__"):
--> 517 return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs)
518 elif data is None or is_scalar(data):
519 cls._scalar_data_error(data)
~/.conda/envs/gpu_env/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
~/.conda/envs/gpu_env/lib/python3.6/site-packages/cudf/core/frame.py in __array__(self, dtype)
1054 To explicitly construct a GPU array, consider using \
1055 cupy.asarray(...)\nTo explicitly construct a \
-> 1056 host array, consider using .to_array()"
1057 )
1058
TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU array, consider using cupy.asarray(...)
To explicitly construct a host array, consider using .to_array()
And trying the cupy.asarray(...) returned:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-21-9a8f72758b42> in <module>
3
4 import cupy
----> 5 ddf['temp'] = cupy.asarray(out[:, 0])
~/.conda/envs/gpu_env/lib/python3.6/site-packages/cupy/creation/from_data.py in asarray(a, dtype, order)
66
67 """
---> 68 return core.array(a, dtype, False, order)
69
70
cupy/core/core.pyx in cupy.core.core.array()
cupy/core/core.pyx in cupy.core.core.array()
cupy/core/core.pyx in cupy.core.core._send_object_to_gpu()
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/array/core.py in __array__(self, dtype, **kwargs)
1338 x = x.astype(dtype)
1339 if not isinstance(x, np.ndarray):
-> 1340 x = np.array(x)
1341 return x
1342
ValueError: object __array__ method not producing an array
@hassanshamji I am not able to reproduce your issue. When I run the following I don't get any errors:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import pandas as pd
import dask.dataframe as dd
import dask_cudf
import numpy as np
import dask.array as da
import cupy as cp
import cudf
from cuml.dask.preprocessing import OneHotEncoder
orig = pd.DataFrame({"one": np.array(["a", "b", "c", "c", "z", "z", "b", "b", "b", "c", "a", "c"],),
"two": np.array(["b", "b", "c", "c", "z", "y", "b", "y", "b", "c", "b", "c"])})
df = dd.from_pandas(orig, npartitions=3)
ddf = dask_cudf.from_dask_dataframe(df)
cluster = LocalCUDACluster()
client = Client(cluster)
ohe = OneHotEncoder(sparse = False)
out = ohe.fit_transform(ddf)
out.compute_chunk_sizes()
out[:, 0].npartitions # 3
ddf['temp'] = out[:, 0] # no error
cluster.close()
client.close()
Could you try this and tell us if this works for you?
@Garfounkel,
That is strange. I ran those steps exactly and am getting the same error that I reproduced above, (TypeError: Implicit conversion to...
)
I'm using a p3.8xlarge | 32 Cores | 244 GB Memory | 4 - NVIDIA V100 GPU
What information can I provide to help diagnose this?
@hassanshamji Could you provide us your installed packages information? (The output of conda list
)
I don't have access to a GPU at the moment so it's a bit difficult to help with this, @divyegala or @dantegd could you follow up on this when you get a moment?
TY @Garfounkel, I'm attaching the conda requirements file I've been using. Please let me know if you need more information.
# Usage: conda env create -f ~/environment/gpu_env.yml
name: gpu_env
--
| channels:
| - https://conda.anaconda.org/rapidsai
| - https://conda.anaconda.org/nvidia
| - conda-main
| - conda-forge
| - conda-r
| - conda-nvidia
| - conda-numba
| - conda-rapidsai
| dependencies:
| - _libgcc_mutex=0.1=conda_forge
| - _openmp_mutex=4.5=0_gnu
| - aiohttp=3.6.2=py36h7b6447c_0
| - appdirs=1.4.3=py36h28b3542_0
| - arrow-cpp=0.15.0=py36h090bef1_2
| - async-timeout=3.0.1=py36_0
| - attrs=19.3.0=py_0
| - backcall=0.1.0=py36_0
| - blas=1.0=mkl
| - bleach=3.1.4=py_0
| - bokeh=1.4.0=py36_0
| - boost=1.70.0=py36h9de70de_1
| - boost-cpp=1.70.0=h8e57a91_2
| - brotli=1.0.7=he6710b0_0
| - bzip2=1.0.8=h7b6447c_0
| - c-ares=1.15.0=h7b6447c_1001
| - ca-certificates=2020.1.1=0
| - cairo=1.16.0=hcf35c78_1003
| - certifi=2020.4.5.1=py36_0
| - cffi=1.14.0=py36h2e261b9_0
| - cfitsio=3.470=hb7c8383_2
| - chardet=3.0.4=py36_1003
| - click=7.1.2=py_0
| - click-plugins=1.1.1=py_0
| - cligj=0.5.0=py36_0
| - cloudpickle=1.4.1=py_0
| - cmake=3.14.0=h52cb24c_0
| - colorcet=2.0.2=py_0
| - contextvars=2.4=py_0
| - cryptography=2.9.2=py36h1ba5d50_0
| - cudatoolkit=10.1.243=h6bb024c_0
| - cudf=0.14.0=py36_0
| - cudnn=7.6.0=cuda10.1_0
| - cugraph=0.14.0=py36_0
| - cuml=0.14.0=cuda10.1_py36_0
| - cupy=7.5.0=py36h5c369b2_0
| - curl=7.67.0=hbc83047_0
| - cusignal=0.14.0=py36_0
| - cuspatial=0.14.0=py36_0
| - cuxfilter=0.14.0=py36_0
| - cycler=0.10.0=py36_0
| - cytoolz=0.10.1=py36h7b6447c_0
| - dask=2.17.2=py_0
| - dask-core=2.17.2=py_0
| - dask-cuda=0.14.0=py36_0
| - dask-cudf=0.14.0=py36_0
| - dask-xgboost=0.2.0.dev28=cuda10.1py36_0
| - datashader=0.10.0=py_0
| - datashape=0.5.4=py36_1
| - dbus=1.13.14=hb2f20db_0
| - decorator=4.4.2=py_0
| - defusedxml=0.6.0=py_0
| - distributed=2.17.0=py36_0
| - dlpack=0.2=he1b5a44_1
| - double-conversion=3.1.5=he6710b0_1
| - entrypoints=0.3=py36_0
| - expat=2.2.6=he6710b0_0
| - fastavro=0.23.4=py36h7b6447c_0
| - fastrlock=0.4=py36he6710b0_0
| - fiona=1.8.11=py36h41e4f33_0
| - fontconfig=2.13.1=h86ecdb6_1001
| - freetype=2.9.1=h8a8886c_1
| - freexl=1.0.5=h14c3975_0
| - fsspec=0.7.4=py_0
| - gdal=3.0.2=py36hbb6b9fb_2
| - geopandas=0.6.1=py_0
| - geos=3.7.2=he1b5a44_2
| - geotiff=1.5.1=h21e8280_1
| - gflags=2.2.2=he6710b0_0
| - giflib=5.1.7=h516909a_1
| - glib=2.63.1=h5a9c865_0
| - glog=0.4.0=he6710b0_0
| - gmp=6.1.2=h6c8ec71_1
| - grpc-cpp=1.23.0=h18db393_0
| - gst-plugins-base=1.14.5=h0935bb2_2
| - gstreamer=1.14.5=h36ae1b5_2
| - hdf4=4.2.13=h3ca952b_2
| - hdf5=1.10.5=nompi_h3c11f04_1104
| - heapdict=1.0.1=py_0
| - icu=64.2=he1b5a44_1
| - idna=2.9=py_1
| - idna_ssl=1.1.0=py36_0
| - imageio=2.8.0=py_0
| - immutables=0.11=py36h7b6447c_0
| - importlib-metadata=1.6.0=py36_0
| - importlib_metadata=1.6.0=0
| - intel-openmp=2020.1=217
| - ipykernel=5.1.4=py36h39e3cac_0
| - ipython=7.13.0=py36h5ca1d4c_0
| - ipython_genutils=0.2.0=py36_0
| - jedi=0.17.0=py36_0
| - jinja2=2.11.2=py_0
| - joblib=0.15.1=py_0
| - jpeg=9d=h516909a_0
| - json-c=0.13.1=h1bed415_0
| - jsonschema=3.2.0=py36_0
| - jupyter-server-proxy=1.5.0=py_0
| - jupyter_client=6.1.3=py_0
| - jupyter_core=4.6.3=py36_0
| - kealib=1.4.13=hec59c27_0
| - kiwisolver=1.2.0=py36hfd86e86_0
| - krb5=1.16.4=h173b8e3_0
| - ld_impl_linux-64=2.33.1=h53a641e_7
| - libcudf=0.14.0=cuda10.1_0
| - libcugraph=0.14.0=cuda10.1_0
| - libcuml=0.14.0=cuda10.1_0
| - libcumlprims=0.14.1=cuda10.1_0
| - libcurl=7.67.0=h20c2e04_0
| - libcuspatial=0.14.0=cuda10.1_0
| - libdap4=3.20.4=hd3bb157_0
| - libedit=3.1.20181209=hc058e9b_0
| - libevent=2.1.10=h72c5cf5_0
| - libffi=3.2.1=hd88cf55_4
| - libgcc-ng=9.2.0=h24d8f2e_2
| - libgdal=3.0.2=hc7cfd23_2
| - libgfortran-ng=7.3.0=hdf63c60_0
| - libgomp=9.2.0=h24d8f2e_2
| - libhwloc=2.1.0=h3c4fd83_0
| - libiconv=1.15=h63c8f33_5
| - libkml=1.3.0=h4fcabce_1010
| - libnetcdf=4.7.1=nompi_h94020b1_102
| - libnvstrings=0.14.0=cuda10.1_0
| - libpng=1.6.37=hbc83047_0
| - libpq=11.5=hd9ab2ff_2
| - libprotobuf=3.8.0=hd408876_0
| - librmm=0.14.0=cuda10.1_0
| - libsodium=1.0.16=h1bed415_0
| - libspatialindex=1.9.3=he6710b0_0
| - libspatialite=4.3.0a=h4f6d029_1032
| - libssh2=1.9.0=h1ba5d50_1
| - libstdcxx-ng=9.1.0=hdf63c60_0
| - libtiff=4.1.0=hfc65ed5_0
| - libuuid=2.32.1=h14c3975_1000
| - libwebp=1.0.1=h8e7db2f_0
| - libxcb=1.13=h1bed415_1
| - libxgboost=1.1.0dev.rapidsai0.14=cuda10.1_0
| - libxml2=2.9.10=hee79883_0
| - lightgbm=2.3.0=py36he6710b0_0
| - llvmlite=0.32.1=py36hd408876_0
| - locket=0.2.0=py36_1
| - lz4-c=1.8.3=he1b5a44_1001
| - markdown=3.1.1=py36_0
| - markupsafe=1.1.1=py36h7b6447c_0
| - matplotlib=3.2.1=0
| - matplotlib-base=3.2.1=py36hb8e4980_0
| - mistune=0.8.4=py36h7b6447c_0
| - mkl=2020.1=217
| - mkl-service=2.3.0=py36he904b0f_0
| - mkl_fft=1.0.15=py36ha843d7b_0
| - mkl_random=1.1.1=py36h0573a6f_0
| - msgpack-python=1.0.0=py36hfd86e86_1
| - multidict=4.7.3=py36h7b6447c_0
| - multipledispatch=0.6.0=py36_0
| - munch=2.5.0=py_0
| - nbconvert=5.6.1=py36_0
| - nbformat=5.0.6=py_0
| - nccl=2.5.7.1=h51cf6c1_0
| - ncurses=6.2=he6710b0_1
| - networkx=2.4=py_0
| - nodejs=10.13.0=he6710b0_0
| - notebook=6.0.3=py36_0
| - numba=0.49.1=py36h0573a6f_0
| - numpy=1.18.1=py36h4f9e942_0
| - numpy-base=1.18.1=py36hde5b4d6_1
| - nvstrings=0.14.0=py36_0
| - olefile=0.46=py36_0
| - openjpeg=2.3.1=h981e76c_3
| - openssl=1.1.1g=h7b6447c_0
| - packaging=20.3=py_0
| - pandas=0.25.3=py36he6710b0_0
| - pandoc=2.2.3.2=0
| - pandocfilters=1.4.2=py36_1
| - panel=0.6.4=0
| - param=1.9.3=py_0
| - parquet-cpp=1.5.1=2
| - parso=0.7.0=py_0
| - partd=1.1.0=py_0
| - pcre=8.43=he6710b0_0
| - pexpect=4.8.0=py36_0
| - pickleshare=0.7.5=py36_0
| - pillow=7.1.2=py36hb39fc2d_0
| - pip=20.0.2=py36_3
| - pixman=0.38.0=h7b6447c_0
| - poppler=0.67.0=h14e79db_8
| - poppler-data=0.4.9=0
| - postgresql=11.5=hc63931a_2
| - proj=6.2.1=haa6030c_0
| - prometheus_client=0.7.1=py_0
| - prompt-toolkit=3.0.5=py_0
| - prompt_toolkit=3.0.5=0
| - psutil=5.7.0=py36h7b6447c_0
| - ptyprocess=0.6.0=py36_0
| - py-xgboost=1.1.0dev.rapidsai0.14=cuda10.1py36_0
| - pyarrow=0.15.0=py36h8b68381_1
| - pycparser=2.20=py_0
| - pyct=0.4.6=py36_0
| - pyee=7.0.2=pyh9f0ad1d_0
| - pygments=2.6.1=py_0
| - pynvml=8.0.4=py_0
| - pyopenssl=19.1.0=py36_0
| - pyparsing=2.4.7=py_0
| - pyppeteer=0.0.25=py_1
| - pyproj=2.6.1.post1=py36hd003209_1
| - pyqt=5.9.2=py36h05f1152_2
| - pyrsistent=0.16.0=py36h7b6447c_0
| - pysocks=1.7.1=py36_0
| - python=3.6.10=hcf32534_1
| - python-dateutil=2.8.1=py_0
| - python_abi=3.6=1_cp36m
| - pytz=2020.1=py_0
| - pyviz_comms=0.7.4=py_0
| - pywavelets=1.1.1=py36h7b6447c_0
| - pyyaml=5.3.1=py36h7b6447c_0
| - pyzmq=18.1.1=py36he6710b0_0
| - qt=5.9.7=h0c104cb_3
| - rapids=0.14.0=cuda10.1_py36_2
| - rapids-xgboost=0.14.0=cuda10.1_py36_2
| - re2=2019.08.01=he6710b0_0
| - readline=8.0=h7b6447c_0
| - requests=2.23.0=py36_0
| - rhash=1.3.8=h1ba5d50_0
| - rmm=0.14.0=py36_0
| - rtree=0.9.4=py36_1
| - scikit-image=0.16.2=py36h0573a6f_0
| - scikit-learn=0.22.1=py36hd81dba3_0
| - scipy=1.4.1=py36h0b6359f_0
| - seaborn=0.10.1=py_0
| - send2trash=1.5.0=py36_0
| - setuptools=47.1.1=py36_0
| - shapely=1.6.4=py36hec07ddf_1006
| - simpervisor=0.3=py_1
| - sip=4.19.8=py36hf484d3e_0
| - six=1.15.0=py_0
| - snappy=1.1.7=hbae5bb6_3
| - sortedcontainers=2.1.0=py36_0
| - spdlog=1.6.1=hc9558a2_0
| - sqlite=3.31.1=h62c20be_1
| - tbb=2018.0.5=h6bb024c_0
| - tblib=1.6.0=py_0
| - terminado=0.8.3=py36_0
| - testpath=0.4.4=py_0
| - thrift-cpp=0.12.0=hf3afdfd_1004
| - tiledb=1.6.2=h7d710e0_2
| - tk=8.6.10=hed695b0_0
| - toolz=0.10.0=py_0
| - tornado=6.0.4=py36h7b6447c_1
| - tqdm=4.46.0=py_0
| - traitlets=4.3.3=py36_0
| - typing_extensions=3.7.4.1=py36_0
| - tzcode=2020a=h516909a_0
| - ucx=1.8.0+gf6ec8d4=cuda10.1_20
| - ucx-py=0.14.0+gf6ec8d4=py36_0
| - uriparser=0.9.3=he6710b0_1
| - urllib3=1.25.8=py36_0
| - wcwidth=0.1.9=py_0
| - webencodings=0.5.1=py36_1
| - websockets=8.1=py36h8c4c3a4_1
| - wheel=0.34.2=py36_0
| - xarray=0.15.1=py_0
| - xerces-c=3.2.2=h8412b87_1004
| - xgboost=1.1.0dev.rapidsai0.14=cuda10.1py36_0
| - xorg-kbproto=1.0.7=h14c3975_1002
| - xorg-libice=1.0.10=h516909a_0
| - xorg-libsm=1.2.3=h84519dc_1000
| - xorg-libx11=1.6.9=h516909a_0
| - xorg-libxext=1.3.4=h516909a_0
| - xorg-libxrender=0.9.10=h516909a_1002
| - xorg-renderproto=0.11.1=h14c3975_1002
| - xorg-xextproto=7.3.0=h14c3975_1002
| - xorg-xproto=7.0.31=h14c3975_1007
| - xz=5.2.5=h7b6447c_0
| - yaml=0.1.7=had09818_2
| - yarl=1.4.2=py36h7b6447c_0
| - zeromq=4.3.1=he6710b0_3
| - zict=2.0.0=py_0
| - zipp=3.1.0=py_0
| - zlib=1.2.11=h7b6447c_3
| - zstd=1.4.3=h3b9ef0a_0
| - pip:
| - treelite==0.91
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
I am having trouble joining the output of cuml.dask.preprocessing.OneHotEncoder with the source dask_cudf. What is a correct way to do that?
Create dask_cudf
OHE
Even though I pass in a dask_cudf, I receive a dask array of cupy.ndarrays (which is slightly unexpected). However, when I inspect the column values it all looks correct.
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/dataframe/core.py in setitem(self, key, value) 3488 df = self.assign({k: value for k in key}) 3489 else: -> 3490 df = self.assign({key: value}) 3491 3492 self.dask = df.dask
~/.conda/envs/gpu_env/lib/python3.6/site-packages/dask/dataframe/core.py in assign(self, **kwargs) 3747 raise ValueError( 3748 "Number of partitions do not match ({0} != {1})".format( -> 3749 v.npartitions, self.npartitions 3750 ) 3751 )
ValueError: Number of partitions do not match (1 != 3)