holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.26k stars 363 forks source link

Error using dask dataframe with incompatible column dtypes #1235

Closed ianthomas23 closed 1 year ago

ianthomas23 commented 1 year ago

Consider datashading a dask dataframe containing columns of different dtypes that are not actually used in the datashade operation:

import dask.dataframe as dd
import datashader as ds
import numpy as np
import pandas as pd

df = pd.DataFrame(
    data=dict(
        x = [0, 1, 2],
        y = [0, 1, 2],
        dates = np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64'),
    )
)
ddf = dd.from_pandas(df, npartitions=2)

canvas = ds.Canvas(2, 2)
agg = canvas.points(ddf, 'x', 'y', ds.count())

Note the dates column is not used in the canvas.points call. Running this gives the following error:

Traceback (most recent call last):
  File "/Users/iant/github_temp/datashader_temp/dask_dtypes.py", line 16, in <module>
    agg = canvas.points(ddf, 'x', 'y', ds.count())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/core.py", line 220, in points
    return bypixel(source, self, glyph, agg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/core.py", line 1257, in bypixel
    return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/utils.py", line 109, in __call__
    return lk[typ](head, *rest, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/data_libraries/dask.py", line 22, in dask_pipeline
    dsk, name = glyph_dispatch(glyph, df, schema, canvas, summary, antialias=antialias, cuda=cuda)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/utils.py", line 112, in __call__
    return lk[cls](head, *rest, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/iant/github/datashader/datashader/data_libraries/dask.py", line 122, in default
    dtype = np.result_type(*dtypes)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in result_type
TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[int64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[int64]'>, <class 'numpy.dtype[int64]'>, <class 'numpy.dtype[datetime64]'>)

Internally in the code that handles dask dataframes there is an attempt to find a dtype that is compatible for all columns of the dataframe. This is unnecessary, we only need to consider the x and y columns here so we can ignore the others.

First reported by @Hoxbro.