Categorical colorizing broken for census parquet file

If I create an outdated environment with conda create -n censusold python=3.7 notebook 'dask<2022.6.2' datashader 'fastparquet<2023.2.0' python-snappy 'pandas<2', categorical colormapping works fine for http://s3.amazonaws.com/datashader-data/census2010.parq.zip unpacked and used with this code:

import datashader as ds, dask.dataframe as dd

df  = dd.read_parquet('./data/census2010.parq')
cvs = ds.Canvas(plot_width=900, plot_height=525, x_range=[-14E6, -7.4E6], y_range=[2.7E6, 6.4E6])
agg = cvs.points(df, 'easting', 'northing', ds.count_cat('race'))
img = ds.tf.shade(agg, how='eq_hist')
img

However, for the latest environment from conda-forge (conda create -n censusnew -c conda-forge python=3.9 notebook dask datashader fastparquet python-snappy pandas), I instead get colors completely mangled in a way that suggests getting different categories per dask partition:

Note that the latest version on defaults (conda create -n censusnew python=3.9 notebook dask datashader fastparquet python-snappy pandas) just dies with an error, perhaps due to the very old fastparquet there:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/toolz/functoolz.py:457, in memoize.<locals>.memof(*args, **kwargs)
    456 try:
--> 457     return cache[k]
    458 except TypeError:

KeyError: (('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 0), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 1), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 2), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 3), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 4), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 5), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 6), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 7), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 8), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 9), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 10), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 11), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 12), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 13), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 14), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 15), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 16), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 17), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 18), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 19), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 20), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 21), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 22), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 23), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 24), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 25), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 26), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 27), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 28), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 29), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 30), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 31), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 32), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 33), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 34), ('read-parquet-b06a2bd22f626e9f06b64bb436d47a95', 35))

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[1], line 5
      3 df  = dd.read_parquet('./data/census2010.parq')
      4 cvs = ds.Canvas(plot_width=900, plot_height=525, x_range=[-14E6, -7.4E6], y_range=[2.7E6, 6.4E6])
----> 5 agg = cvs.points(df, 'easting', 'northing', ds.count_cat('race'))
      6 img = ds.tf.shade(agg, how='eq_hist')
      7 img

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/datashader/core.py:219, in Canvas.points(self, source, x, y, agg, geometry)
    212         raise ValueError(
    213             "source must be an instance of spatialpandas.GeoDataFrame or \n"
    214             "spatialpandas.dask.DaskGeoDataFrame.\n"
    215             "  Received value of type {typ}".format(typ=type(source)))
    217     glyph = MultiPointGeometry(geometry)
--> 219 return bypixel(source, self, glyph, agg)

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/datashader/core.py:1250, in bypixel(source, canvas, glyph, agg, antialias)
   1235 def bypixel(source, canvas, glyph, agg, *, antialias=False):
   1236     """Compute an aggregate grouped by pixel sized bins.
   1237 
   1238     Aggregate input data ``source`` into a grid with shape and axis matching
   (...)
   1248     agg : Reduction
   1249     """
-> 1250     source, dshape = _bypixel_sanitise(source, glyph, agg)
   1252     schema = dshape.measure
   1253     glyph.validate(schema)

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/datashader/core.py:1295, in _bypixel_sanitise(source, glyph, agg)
   1293     dshape = dshape_from_pandas(source)
   1294 elif isinstance(source, dd.DataFrame):
-> 1295     dshape = dshape_from_dask(source)
   1296 elif isinstance(source, Dataset):
   1297     # Multi-dimensional Dataset
   1298     dshape = dshape_from_xarray_dataset(source)

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/toolz/functoolz.py:461, in memoize.<locals>.memof(*args, **kwargs)
    459     raise TypeError("Arguments to memoized function must be hashable")
    460 except KeyError:
--> 461     cache[k] = result = func(*args, **kwargs)
    462     return result

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/datashader/utils.py:454, in dshape_from_dask(df)
    448 """Return a datashape.DataShape object given a dask dataframe."""
    449 cat_columns = [
    450     col for col in df.columns
    451     if (isinstance(df[col].dtype, type(pd.Categorical.dtype)) or
    452         isinstance(df[col].dtype, pd.api.types.CategoricalDtype))
    453        and not getattr(df[col].cat, 'known', True)]
--> 454 df = df.categorize(cat_columns, index=False)
    455 # get_partition(0) used below because categories are sometimes repeated
    456 # for dask-cudf DataFrames with multiple partitions
    457 return datashape.var * datashape.Record([
    458     (k, dshape_from_pandas_helper(df[k].get_partition(0))) for k in df.columns
    459 ])

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/core.py:5266, in DataFrame.categorize(self, columns, index, split_every, **kwargs)
   5264 @wraps(categorize)
   5265 def categorize(self, columns=None, index=None, split_every=None, **kwargs):
-> 5266     return categorize(
   5267         self, columns=columns, index=index, split_every=split_every, **kwargs
   5268     )

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/categorical.py:152, in categorize(df, columns, index, split_every, **kwargs)
    149 dsk.update(df.dask)
    151 # Compute the categories
--> 152 categories, index = compute_as_if_collection(
    153     df.__class__, dsk, (prefix, 0), **kwargs
    154 )
    156 # some operations like get_dummies() rely on the order of categories
    157 categories = {k: v.sort_values() for k, v in categories.items()}

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/base.py:341, in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs)
    339 schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get)
    340 dsk2 = optimization_function(cls)(dsk, keys, **kwargs)
--> 341 return schedule(dsk2, keys, **kwargs)

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/threaded.py:89, in get(dsk, keys, cache, num_workers, pool, **kwargs)
     86     elif isinstance(pool, multiprocessing.pool.Pool):
     87         pool = MultiprocessingPoolExecutor(pool)
---> 89 results = get_async(
     90     pool.submit,
     91     pool._max_workers,
     92     dsk,
     93     keys,
     94     cache=cache,
     95     get_id=_thread_get_id,
     96     pack_exception=pack_exception,
     97     **kwargs,
     98 )
    100 # Cleanup pools associated to dead threads
    101 with pools_lock:

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/local.py:511, in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    509         _execute_task(task, data)  # Re-execute locally
    510     else:
--> 511         raise_exception(exc, tb)
    512 res, worker_id = loads(res_info)
    513 state["cache"][key] = res

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/local.py:319, in reraise(exc, tb)
    317 if exc.__traceback__ is not tb:
    318     raise exc.with_traceback(tb)
--> 319 raise exc

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/local.py:224, in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    222 try:
    223     task, data = loads(task_info)
--> 224     result = _execute_task(task, data)
    225     id = get_id()
    226     result = dumps((result, id))

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/optimization.py:990, in SubgraphCallable.__call__(self, *args)
    988 if not len(args) == len(self.inkeys):
    989     raise ValueError("Expected %d args, got %d" % (len(self.inkeys), len(args)))
--> 990 return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/core.py:149, in get(dsk, out, cache)
    147 for key in toposort(dsk):
    148     task = dsk[key]
--> 149     result = _execute_task(task, cache)
    150     cache[key] = result
    151 result = _execute_task(out, cache)

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/core.py:119, in _execute_task(arg, cache, dsk)
    115     func, args = arg[0], arg[1:]
    116     # Note: Don't assign the subtask results to a variable. numpy detects
    117     # temporaries by their reference count and can execute certain
    118     # operations in-place.
--> 119     return func(*(_execute_task(a, cache) for a in args))
    120 elif not ishashable(arg):
    121     return arg

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py:96, in ParquetFunctionWrapper.__call__(self, part)
     93 if not isinstance(part, list):
     94     part = [part]
---> 96 return read_parquet_part(
     97     self.fs,
     98     self.engine,
     99     self.meta,
    100     [
    101         # Temporary workaround for HLG serialization bug
    102         # (see: https://github.com/dask/dask/issues/8581)
    103         (p.data["piece"], p.data.get("kwargs", {}))
    104         if hasattr(p, "data")
    105         else (p["piece"], p.get("kwargs", {}))
    106         for p in part
    107     ],
    108     self.columns,
    109     self.index,
    110     self.use_nullable_dtypes,
    111     self.common_kwargs,
    112 )

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py:663, in read_parquet_part(fs, engine, meta, part, columns, index, use_nullable_dtypes, kwargs)
    660 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
    661     # Part kwargs expected
    662     func = engine.read_partition
--> 663     dfs = [
    664         func(
    665             fs,
    666             rg,
    667             columns.copy(),
    668             index,
    669             use_nullable_dtypes=use_nullable_dtypes,
    670             **toolz.merge(kwargs, kw),
    671         )
    672         for (rg, kw) in part
    673     ]
    674     df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
    675 else:
    676     # No part specific kwargs, let engine read
    677     # list of parts at once

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py:664, in <listcomp>(.0)
    660 if len(part) == 1 or part[0][1] or not check_multi_support(engine):
    661     # Part kwargs expected
    662     func = engine.read_partition
    663     dfs = [
--> 664         func(
    665             fs,
    666             rg,
    667             columns.copy(),
    668             index,
    669             use_nullable_dtypes=use_nullable_dtypes,
    670             **toolz.merge(kwargs, kw),
    671         )
    672         for (rg, kw) in part
    673     ]
    674     df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0]
    675 else:
    676     # No part specific kwargs, let engine read
    677     # list of parts at once

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/io/parquet/fastparquet.py:1073, in FastParquetEngine.read_partition(cls, fs, pieces, columns, index, use_nullable_dtypes, categories, root_cats, root_file_scheme, base_path, **kwargs)
   1068     parquet_file._dtypes = (
   1069         lambda *args: parquet_file.dtypes
   1070     )  # ugly patch, could be fixed
   1072     # Convert ParquetFile to pandas
-> 1073     return cls.pf_to_pandas(
   1074         parquet_file,
   1075         fs=fs,
   1076         columns=columns,
   1077         categories=categories,
   1078         index=index,
   1079         **kwargs.get("read", {}),
   1080     )
   1082 else:
   1083     # `sample` is NOT a tuple
   1084     raise ValueError(f"Expected tuple, got {type(sample)}")

File ~/miniconda3/envs/censusnew/lib/python3.9/site-packages/dask/dataframe/io/parquet/fastparquet.py:1164, in FastParquetEngine.pf_to_pandas(cls, pf, fs, columns, categories, index, open_file_options, **kwargs)
   1154             parts = {
   1155                 name: (
   1156                     v
   (...)
   1160                 for (name, v) in views.items()
   1161             }
   1163             # Add row-group data to df
-> 1164             pf.read_row_group_file(
   1165                 rg,
   1166                 columns,
   1167                 categories,
   1168                 index,
   1169                 assign=parts,
   1170                 partition_meta=pf.partition_meta,
   1171                 infile=infile,
   1172                 **kwargs,
   1173             )
   1174             start += thislen
   1175 return df

TypeError: read_row_group_file() got an unexpected keyword argument 'infile'

I get the right coloring when using engine="fastparquet" and get the wrong coloring with engine="pyarrow" in dd.read_parquet.

Thanks, @Hoxbro and @martindurant! Looks like indeed those crazy patterns are to do with pyarrow, presumably using different category values in each partition.

Unfortunately, looks like there is still an issue even with fastparquet, because when I specify the color mapping, the newer environments don't match the color key:

import datashader as ds, dask.dataframe as dd
color_key = {'w':'aqua', 'b':'lime',  'a':'red', 'h':'fuchsia', 'o':'yellow' }

df  = dd.read_parquet('./data/census2010.parq', engine='fastparquet')
cvs = ds.Canvas(plot_width=900, plot_height=525, x_range=[-14E6, -7.4E6], y_range=[2.7E6, 6.4E6])
agg = cvs.points(df, 'easting', 'northing', ds.count_cat('race'))
img = ds.tf.shade(agg, how='eq_hist', color_key=color_key)
img

The older versions in censusold do match the color key for the same code, with e.g. Maine being colored aqua:

@martindurant, any idea how that could have happened (given that the user code and the datashader version are the same in both environments, unless I got confused)?

I'm not sure what's involved in converting values in parquet to colours or what "count_cat" does. I think I would do a simple .value_counts() to ensure that the gross totals are roughly right for the categories before going further.

Looks right:

(same listing for both censusold and censusnew)

To be concrete, for the same code and the same Datashader version, different output despite the dataframe seemingly having the same counts:

conda list
# packages in environment at /Users/jbednar/miniconda3/envs/censusnew:
#
# Name                    Version                   Build  Channel
anyio                     3.6.2              pyhd8ed1ab_0    conda-forge
appnope                   0.1.3              pyhd8ed1ab_0    conda-forge
argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0           py39ha30fb19_3    conda-forge
arrow-cpp                 11.0.0          h694c41f_15_cpu    conda-forge
asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
aws-c-auth                0.6.26               hb063c81_3    conda-forge
aws-c-cal                 0.5.21               hf54dd2f_2    conda-forge
aws-c-common              0.8.14               hb7f2c08_0    conda-forge
aws-c-compression         0.2.16               h99c63db_5    conda-forge
aws-c-event-stream        0.2.20               hbf6f731_5    conda-forge
aws-c-http                0.7.6                h58d2db5_1    conda-forge
aws-c-io                  0.13.21              had634fe_0    conda-forge
aws-c-mqtt                0.8.6               h14bedde_13    conda-forge
aws-c-s3                  0.2.8                h7103d8a_1    conda-forge
aws-c-sdkutils            0.1.9                h99c63db_0    conda-forge
aws-checksums             0.1.14               h99c63db_5    conda-forge
aws-crt-cpp               0.19.9               hb02fd3d_2    conda-forge
aws-sdk-cpp               1.10.57              h74c80f7_9    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                pyhd8ed1ab_3    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
beautifulsoup4            4.12.2             pyha770c72_0    conda-forge
bleach                    6.0.0              pyhd8ed1ab_0    conda-forge
bokeh                     3.1.0              pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py39ha30fb19_1005    conda-forge
bzip2                     1.0.8                h0d85af4_4    conda-forge
c-ares                    1.18.1               h0d85af4_0    conda-forge
ca-certificates           2022.12.7            h033912b_0    conda-forge
certifi                   2022.12.7          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1           py39h131948b_3    conda-forge
charset-normalizer        3.1.0              pyhd8ed1ab_0    conda-forge
click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
colorcet                  3.0.1              pyhd8ed1ab_0    conda-forge
comm                      0.1.3              pyhd8ed1ab_0    conda-forge
contourpy                 1.0.7            py39h92daf61_0    conda-forge
cramjam                   2.6.2            py39hd4bc93a_0    conda-forge
cryptography              40.0.2           py39hbeae22c_0    conda-forge
cytoolz                   0.12.0           py39ha30fb19_1    conda-forge
dask                      2023.4.0           pyhd8ed1ab_0    conda-forge
dask-core                 2023.4.0           pyhd8ed1ab_0    conda-forge
datashader                0.14.4             pyh1a96a4e_0    conda-forge
datashape                 0.5.4                      py_1    conda-forge
debugpy                   1.6.7            py39h7a8716b_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
distributed               2023.4.0           pyhd8ed1ab_0    conda-forge
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
executing                 1.2.0              pyhd8ed1ab_0    conda-forge
fastparquet               2023.2.0         py39h7cc1f47_0    conda-forge
flit-core                 3.8.0              pyhd8ed1ab_0    conda-forge
freetype                  2.12.1               h3f81eb7_1    conda-forge
fsspec                    2023.4.0           pyh1a96a4e_0    conda-forge
gflags                    2.2.2             hb1e8313_1004    conda-forge
glog                      0.6.0                h8ac2a54_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        6.6.0              pyha770c72_0    conda-forge
importlib_metadata        6.6.0                hd8ed1ab_0    conda-forge
importlib_resources       5.12.0             pyhd8ed1ab_0    conda-forge
ipykernel                 6.22.0             pyh736e0ef_0    conda-forge
ipython                   8.12.0             pyhd1c38e8_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
jsonschema                4.17.3             pyhd8ed1ab_0    conda-forge
jupyter_client            8.2.0              pyhd8ed1ab_0    conda-forge
jupyter_core              5.3.0            py39h6e9494a_0    conda-forge
jupyter_events            0.6.3              pyhd8ed1ab_0    conda-forge
jupyter_server            2.5.0              pyhd8ed1ab_0    conda-forge
jupyter_server_terminals  0.4.4              pyhd8ed1ab_1    conda-forge
jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
krb5                      1.20.1               h049b76e_0    conda-forge
lcms2                     2.15                 h2dcdeff_1    conda-forge
lerc                      4.0.0                hb486fe8_0    conda-forge
libabseil                 20230125.0      cxx17_hf0c8a7f_1    conda-forge
libarrow                  11.0.0          h53a6c5b_15_cpu    conda-forge
libblas                   3.9.0           16_osx64_openblas    conda-forge
libbrotlicommon           1.0.9                hb7f2c08_8    conda-forge
libbrotlidec              1.0.9                hb7f2c08_8    conda-forge
libbrotlienc              1.0.9                hb7f2c08_8    conda-forge
libcblas                  3.9.0           16_osx64_openblas    conda-forge
libcrc32c                 1.1.2                he49afe7_0    conda-forge
libcurl                   8.0.1                h1fead75_0    conda-forge
libcxx                    16.0.2               hd57cbcb_0    conda-forge
libdeflate                1.18                 hac1461d_0    conda-forge
libedit                   3.1.20191231         h0678c8f_2    conda-forge
libev                     4.33                 haf1e3a3_1    conda-forge
libevent                  2.1.10               h7d65743_4    conda-forge
libffi                    3.4.2                h0d85af4_5    conda-forge
libgfortran               5.0.0           11_3_0_h97931a8_31    conda-forge
libgfortran5              12.2.0              he409387_31    conda-forge
libgoogle-cloud           2.8.0                h176059f_1    conda-forge
libgrpc                   1.52.1               h5bc3d57_1    conda-forge
libjpeg-turbo             2.1.5.1              hb7f2c08_0    conda-forge
liblapack                 3.9.0           16_osx64_openblas    conda-forge
libllvm11                 11.1.0               h8fb7429_5    conda-forge
libnghttp2                1.52.0               he2ab024_0    conda-forge
libopenblas               0.3.21          openmp_h429af6e_3    conda-forge
libpng                    1.6.39               ha978bb4_0    conda-forge
libprotobuf               3.21.12              hbc0c0cd_0    conda-forge
libsodium                 1.0.18               hbcb3906_1    conda-forge
libsqlite                 3.40.0               ha978bb4_1    conda-forge
libssh2                   1.10.0               h47af595_3    conda-forge
libthrift                 0.18.1               h16802d8_0    conda-forge
libtiff                   4.5.0                hedf67fa_6    conda-forge
libutf8proc               2.8.0                hb7f2c08_0    conda-forge
libwebp-base              1.3.0                hb7f2c08_0    conda-forge
libxcb                    1.13              h0d85af4_1004    conda-forge
libzlib                   1.2.13               hfd90126_4    conda-forge
llvm-openmp               16.0.2               hff08bdf_0    conda-forge
llvmlite                  0.39.1           py39had167e2_1    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
lz4                       4.3.2            py39hd0af75a_0    conda-forge
lz4-c                     1.9.4                hf0c8a7f_0    conda-forge
markupsafe                2.1.2            py39ha30fb19_0    conda-forge
matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
mistune                   2.0.5              pyhd8ed1ab_0    conda-forge
msgpack-python            1.0.5            py39h92daf61_0    conda-forge
multipledispatch          0.6.0                      py_0    conda-forge
nbclassic                 0.5.5              pyh8b2e9e2_0    conda-forge
nbclient                  0.7.4              pyhd8ed1ab_0    conda-forge
nbconvert-core            7.3.1              pyhd8ed1ab_0    conda-forge
nbformat                  5.8.0              pyhd8ed1ab_0    conda-forge
ncurses                   6.3                  h96cf925_1    conda-forge
nest-asyncio              1.5.6              pyhd8ed1ab_0    conda-forge
notebook                  6.5.4              pyha770c72_0    conda-forge
notebook-shim             0.2.3              pyhd8ed1ab_0    conda-forge
numba                     0.56.4           py39h6e2ba77_1    conda-forge
numpy                     1.23.5           py39hdfa1d0c_0    conda-forge
openjpeg                  2.5.0                h13ac156_2    conda-forge
openssl                   3.1.0                h8a1eda9_2    conda-forge
orc                       1.8.3                ha9d861c_0    conda-forge
packaging                 23.1               pyhd8ed1ab_0    conda-forge
pandas                    2.0.1            py39h11b3245_0    conda-forge
pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
param                     1.13.0             pyh1a96a4e_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
partd                     1.4.0              pyhd8ed1ab_0    conda-forge
pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.5.0            py39h77c96bc_0    conda-forge
pip                       23.1.1             pyhd8ed1ab_0    conda-forge
pkgutil-resolve-name      1.3.10             pyhd8ed1ab_0    conda-forge
platformdirs              3.3.0              pyhd8ed1ab_0    conda-forge
pooch                     1.7.0              pyha770c72_3    conda-forge
prometheus_client         0.16.0             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
prompt_toolkit            3.0.38               hd8ed1ab_0    conda-forge
psutil                    5.9.5            py39ha30fb19_0    conda-forge
pthread-stubs             0.4               hc929b4f_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
pyarrow                   11.0.0          py39h105b94d_15_cpu    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pyct                      0.4.6                      py_0    conda-forge
pyct-core                 0.4.6                      py_0    conda-forge
pygments                  2.15.1             pyhd8ed1ab_0    conda-forge
pyopenssl                 23.1.1             pyhd8ed1ab_0    conda-forge
pyrsistent                0.19.3           py39ha30fb19_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.9.16          h709bd14_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-fastjsonschema     2.16.3             pyhd8ed1ab_0    conda-forge
python-json-logger        2.0.7              pyhd8ed1ab_0    conda-forge
python-snappy             0.6.1            py39hf74c2c1_0    conda-forge
python-tzdata             2023.3             pyhd8ed1ab_0    conda-forge
python_abi                3.9                      3_cp39    conda-forge
pytz                      2023.3             pyhd8ed1ab_0    conda-forge
pyyaml                    6.0              py39ha30fb19_5    conda-forge
pyzmq                     25.0.2           py39hed8f129_0    conda-forge
re2                       2023.02.02           hf0c8a7f_0    conda-forge
readline                  8.2                  h9e318b2_1    conda-forge
requests                  2.28.2             pyhd8ed1ab_1    conda-forge
rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
scipy                     1.10.1           py39h4c5e66d_0    conda-forge
send2trash                1.8.0              pyhd8ed1ab_0    conda-forge
setuptools                67.7.2             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.10               h225ccf5_0    conda-forge
sniffio                   1.3.0              pyhd8ed1ab_0    conda-forge
sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
terminado                 0.17.1             pyhd1c38e8_0    conda-forge
tinycss2                  1.2.1              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h5dbffcc_0    conda-forge
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tornado                   6.3              py39ha30fb19_0    conda-forge
traitlets                 5.9.0              pyhd8ed1ab_0    conda-forge
typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
typing_extensions         4.5.0              pyha770c72_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
webencodings              0.5.1                      py_1    conda-forge
websocket-client          1.5.1              pyhd8ed1ab_0    conda-forge
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xarray                    2023.4.2           pyhd8ed1ab_0    conda-forge
xorg-libxau               1.0.9                h35c211d_0    conda-forge
xorg-libxdmcp             1.1.3                h35c211d_0    conda-forge
xyzservices               2023.2.0           pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h775f41a_0    conda-forge
yaml                      0.2.5                h0d85af4_2    conda-forge
zeromq                    4.3.4                he49afe7_1    conda-forge
zict                      3.0.0              pyhd8ed1ab_0    conda-forge
zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               hfd90126_4    conda-forge
zstd                      1.5.2                hbc0c0cd_6    conda-forge

Upon investigation there is an assumption in datashader's handling of categorical columns that each partition has its categories sorted in the same order. A categorical aggregation is 3D of shape (ny, nx, ncat) where ncat is the number of categories, and internally we don't a category directly but use its index into the sequence of categories. Each partition is internally consistent, but when combining the results from multiple partitions across the categories the difference in indexes combines them incorrectly resulting in different colors.

For the 2010 US census data loaded using pyarrow the category order varies across the partitions (but is repeatable). Using fastparquet the category orders are the same across all partitions, but this is different from the order of categories used for the colormapping (which happens at the dask dataframe level not individual partition level).

There is a related issue on dask: https://github.com/dask/dask/issues/9467

We need to solve this within datashader but there is fortunately a workaround. After the dask.dataframe.read_parquet call add

df = df.categorize('race')

and the output is correct using either fastparquet 2023.2.0 or pyarrow 10.0.1.

I don't think that fastparquet should be recoding the column on load - it must be showing the real encoding in the files, the same across all of them. So what is pyarrow doing? No idea.

df = df.categorize('race')

Is there no cost associated with this?

Is there no cost associated with this?

There is significant cost in doing this.

Damn. I guess we need to move forward with the fix in Datashader, then.

The example works fine for dask <= 2022.7.0 and fails for dask >= 2022.7.1. The explanation is in the dask documentation at the bottom of this page: https://docs.dask.org/en/stable/dataframe-categoricals.html. The important quote is "If you write and read to parquet, Dask will forget known categories. This happens because, due to performance concerns, all the categories are saved in every partition rather than in the parquet metadata" and this is followed by an explanation of how to deal with this which is something along the lines of

if not ddf.col.cat.known:
    ddf.col = ddf.col.cat.set_categories(ddf.col.head(1).cat.categories)

where col is a categorical column that we want to use.

We can replicate the error using the US census data, but this is too large for a repeatable test. We can also do a cycle of save to parquet followed by read from parquet to replicate it. But here is a simpler reproducer that we'll be able to add to the datashader test suite:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(data=dict(col = ['a', 'b', 'c', 'a', 'b', 'c', 'b', 'b', 'b', 'b', 'b', 'b']))
ddf = dd.from_pandas(df, npartitions=2)
ddf.col = ddf.col.astype('category')

for i in range(ddf.npartitions):
    partition = ddf.get_partition(i)
    print("Partition counts", i, dict(partition.col.value_counts().compute()))

which produces

Partition counts 0 {'a': 2, 'b': 2, 'c': 2}
Partition counts 1 {'b': 6}

If you use this in datashader, all the partition 1 'b' counts are assigned to categorical index 0 so they are combined with the partition 0 'a' counts, which is incorrect. Adding the recommended code from the dask docs:

if not ddf.col.cat.known:
    ddf.col = ddf.col.cat.set_categories(ddf.col.head(1).cat.categories)

gives

Partition counts 0 {'a': 2, 'b': 2, 'c': 2}
Partition counts 1 {'b': 6, 'a': 0, 'c': 0}

which works as expected. (The order of entries is different above but the underlying Indexes have identical order and give the correct datashader output).

So the difference between fastparquet and pyarrow, is that fastparquet saves the pandas categories as-is, using the existing coding, and presumably must arrow re-code on save.

holoviz / datashader

Categorical colorizing broken for census parquet file #1202