Closed jbednar closed 1 year ago
I get the right coloring when using engine="fastparquet"
and get the wrong coloring with engine="pyarrow"
in dd.read_parquet
.
Thanks, @Hoxbro and @martindurant! Looks like indeed those crazy patterns are to do with pyarrow, presumably using different category values in each partition.
Unfortunately, looks like there is still an issue even with fastparquet, because when I specify the color mapping, the newer environments don't match the color key:
import datashader as ds, dask.dataframe as dd
color_key = {'w':'aqua', 'b':'lime', 'a':'red', 'h':'fuchsia', 'o':'yellow' }
df = dd.read_parquet('./data/census2010.parq', engine='fastparquet')
cvs = ds.Canvas(plot_width=900, plot_height=525, x_range=[-14E6, -7.4E6], y_range=[2.7E6, 6.4E6])
agg = cvs.points(df, 'easting', 'northing', ds.count_cat('race'))
img = ds.tf.shade(agg, how='eq_hist', color_key=color_key)
img
The older versions in censusold do match the color key for the same code, with e.g. Maine being colored aqua:
@martindurant, any idea how that could have happened (given that the user code and the datashader version are the same in both environments, unless I got confused)?
I'm not sure what's involved in converting values in parquet to colours or what "count_cat" does. I think I would do a simple .value_counts()
to ensure that the gross totals are roughly right for the categories before going further.
Looks right:
(same listing for both censusold and censusnew)
To be concrete, for the same code and the same Datashader version, different output despite the dataframe seemingly having the same counts:
conda list
# packages in environment at /Users/jbednar/miniconda3/envs/censusnew:
#
# Name Version Build Channel
anyio 3.6.2 pyhd8ed1ab_0 conda-forge
appnope 0.1.3 pyhd8ed1ab_0 conda-forge
argon2-cffi 21.3.0 pyhd8ed1ab_0 conda-forge
argon2-cffi-bindings 21.2.0 py39ha30fb19_3 conda-forge
arrow-cpp 11.0.0 h694c41f_15_cpu conda-forge
asttokens 2.2.1 pyhd8ed1ab_0 conda-forge
attrs 22.2.0 pyh71513ae_0 conda-forge
aws-c-auth 0.6.26 hb063c81_3 conda-forge
aws-c-cal 0.5.21 hf54dd2f_2 conda-forge
aws-c-common 0.8.14 hb7f2c08_0 conda-forge
aws-c-compression 0.2.16 h99c63db_5 conda-forge
aws-c-event-stream 0.2.20 hbf6f731_5 conda-forge
aws-c-http 0.7.6 h58d2db5_1 conda-forge
aws-c-io 0.13.21 had634fe_0 conda-forge
aws-c-mqtt 0.8.6 h14bedde_13 conda-forge
aws-c-s3 0.2.8 h7103d8a_1 conda-forge
aws-c-sdkutils 0.1.9 h99c63db_0 conda-forge
aws-checksums 0.1.14 h99c63db_5 conda-forge
aws-crt-cpp 0.19.9 hb02fd3d_2 conda-forge
aws-sdk-cpp 1.10.57 h74c80f7_9 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 pyhd8ed1ab_3 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
beautifulsoup4 4.12.2 pyha770c72_0 conda-forge
bleach 6.0.0 pyhd8ed1ab_0 conda-forge
bokeh 3.1.0 pyhd8ed1ab_0 conda-forge
brotlipy 0.7.0 py39ha30fb19_1005 conda-forge
bzip2 1.0.8 h0d85af4_4 conda-forge
c-ares 1.18.1 h0d85af4_0 conda-forge
ca-certificates 2022.12.7 h033912b_0 conda-forge
certifi 2022.12.7 pyhd8ed1ab_0 conda-forge
cffi 1.15.1 py39h131948b_3 conda-forge
charset-normalizer 3.1.0 pyhd8ed1ab_0 conda-forge
click 8.1.3 unix_pyhd8ed1ab_2 conda-forge
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
colorcet 3.0.1 pyhd8ed1ab_0 conda-forge
comm 0.1.3 pyhd8ed1ab_0 conda-forge
contourpy 1.0.7 py39h92daf61_0 conda-forge
cramjam 2.6.2 py39hd4bc93a_0 conda-forge
cryptography 40.0.2 py39hbeae22c_0 conda-forge
cytoolz 0.12.0 py39ha30fb19_1 conda-forge
dask 2023.4.0 pyhd8ed1ab_0 conda-forge
dask-core 2023.4.0 pyhd8ed1ab_0 conda-forge
datashader 0.14.4 pyh1a96a4e_0 conda-forge
datashape 0.5.4 py_1 conda-forge
debugpy 1.6.7 py39h7a8716b_0 conda-forge
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge
distributed 2023.4.0 pyhd8ed1ab_0 conda-forge
entrypoints 0.4 pyhd8ed1ab_0 conda-forge
executing 1.2.0 pyhd8ed1ab_0 conda-forge
fastparquet 2023.2.0 py39h7cc1f47_0 conda-forge
flit-core 3.8.0 pyhd8ed1ab_0 conda-forge
freetype 2.12.1 h3f81eb7_1 conda-forge
fsspec 2023.4.0 pyh1a96a4e_0 conda-forge
gflags 2.2.2 hb1e8313_1004 conda-forge
glog 0.6.0 h8ac2a54_0 conda-forge
idna 3.4 pyhd8ed1ab_0 conda-forge
importlib-metadata 6.6.0 pyha770c72_0 conda-forge
importlib_metadata 6.6.0 hd8ed1ab_0 conda-forge
importlib_resources 5.12.0 pyhd8ed1ab_0 conda-forge
ipykernel 6.22.0 pyh736e0ef_0 conda-forge
ipython 8.12.0 pyhd1c38e8_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.18.2 pyhd8ed1ab_0 conda-forge
jinja2 3.1.2 pyhd8ed1ab_1 conda-forge
jsonschema 4.17.3 pyhd8ed1ab_0 conda-forge
jupyter_client 8.2.0 pyhd8ed1ab_0 conda-forge
jupyter_core 5.3.0 py39h6e9494a_0 conda-forge
jupyter_events 0.6.3 pyhd8ed1ab_0 conda-forge
jupyter_server 2.5.0 pyhd8ed1ab_0 conda-forge
jupyter_server_terminals 0.4.4 pyhd8ed1ab_1 conda-forge
jupyterlab_pygments 0.2.2 pyhd8ed1ab_0 conda-forge
krb5 1.20.1 h049b76e_0 conda-forge
lcms2 2.15 h2dcdeff_1 conda-forge
lerc 4.0.0 hb486fe8_0 conda-forge
libabseil 20230125.0 cxx17_hf0c8a7f_1 conda-forge
libarrow 11.0.0 h53a6c5b_15_cpu conda-forge
libblas 3.9.0 16_osx64_openblas conda-forge
libbrotlicommon 1.0.9 hb7f2c08_8 conda-forge
libbrotlidec 1.0.9 hb7f2c08_8 conda-forge
libbrotlienc 1.0.9 hb7f2c08_8 conda-forge
libcblas 3.9.0 16_osx64_openblas conda-forge
libcrc32c 1.1.2 he49afe7_0 conda-forge
libcurl 8.0.1 h1fead75_0 conda-forge
libcxx 16.0.2 hd57cbcb_0 conda-forge
libdeflate 1.18 hac1461d_0 conda-forge
libedit 3.1.20191231 h0678c8f_2 conda-forge
libev 4.33 haf1e3a3_1 conda-forge
libevent 2.1.10 h7d65743_4 conda-forge
libffi 3.4.2 h0d85af4_5 conda-forge
libgfortran 5.0.0 11_3_0_h97931a8_31 conda-forge
libgfortran5 12.2.0 he409387_31 conda-forge
libgoogle-cloud 2.8.0 h176059f_1 conda-forge
libgrpc 1.52.1 h5bc3d57_1 conda-forge
libjpeg-turbo 2.1.5.1 hb7f2c08_0 conda-forge
liblapack 3.9.0 16_osx64_openblas conda-forge
libllvm11 11.1.0 h8fb7429_5 conda-forge
libnghttp2 1.52.0 he2ab024_0 conda-forge
libopenblas 0.3.21 openmp_h429af6e_3 conda-forge
libpng 1.6.39 ha978bb4_0 conda-forge
libprotobuf 3.21.12 hbc0c0cd_0 conda-forge
libsodium 1.0.18 hbcb3906_1 conda-forge
libsqlite 3.40.0 ha978bb4_1 conda-forge
libssh2 1.10.0 h47af595_3 conda-forge
libthrift 0.18.1 h16802d8_0 conda-forge
libtiff 4.5.0 hedf67fa_6 conda-forge
libutf8proc 2.8.0 hb7f2c08_0 conda-forge
libwebp-base 1.3.0 hb7f2c08_0 conda-forge
libxcb 1.13 h0d85af4_1004 conda-forge
libzlib 1.2.13 hfd90126_4 conda-forge
llvm-openmp 16.0.2 hff08bdf_0 conda-forge
llvmlite 0.39.1 py39had167e2_1 conda-forge
locket 1.0.0 pyhd8ed1ab_0 conda-forge
lz4 4.3.2 py39hd0af75a_0 conda-forge
lz4-c 1.9.4 hf0c8a7f_0 conda-forge
markupsafe 2.1.2 py39ha30fb19_0 conda-forge
matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge
mistune 2.0.5 pyhd8ed1ab_0 conda-forge
msgpack-python 1.0.5 py39h92daf61_0 conda-forge
multipledispatch 0.6.0 py_0 conda-forge
nbclassic 0.5.5 pyh8b2e9e2_0 conda-forge
nbclient 0.7.4 pyhd8ed1ab_0 conda-forge
nbconvert-core 7.3.1 pyhd8ed1ab_0 conda-forge
nbformat 5.8.0 pyhd8ed1ab_0 conda-forge
ncurses 6.3 h96cf925_1 conda-forge
nest-asyncio 1.5.6 pyhd8ed1ab_0 conda-forge
notebook 6.5.4 pyha770c72_0 conda-forge
notebook-shim 0.2.3 pyhd8ed1ab_0 conda-forge
numba 0.56.4 py39h6e2ba77_1 conda-forge
numpy 1.23.5 py39hdfa1d0c_0 conda-forge
openjpeg 2.5.0 h13ac156_2 conda-forge
openssl 3.1.0 h8a1eda9_2 conda-forge
orc 1.8.3 ha9d861c_0 conda-forge
packaging 23.1 pyhd8ed1ab_0 conda-forge
pandas 2.0.1 py39h11b3245_0 conda-forge
pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge
param 1.13.0 pyh1a96a4e_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.8.3 pyhd8ed1ab_0 conda-forge
partd 1.4.0 pyhd8ed1ab_0 conda-forge
pexpect 4.8.0 pyh1a96a4e_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 9.5.0 py39h77c96bc_0 conda-forge
pip 23.1.1 pyhd8ed1ab_0 conda-forge
pkgutil-resolve-name 1.3.10 pyhd8ed1ab_0 conda-forge
platformdirs 3.3.0 pyhd8ed1ab_0 conda-forge
pooch 1.7.0 pyha770c72_3 conda-forge
prometheus_client 0.16.0 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.38 pyha770c72_0 conda-forge
prompt_toolkit 3.0.38 hd8ed1ab_0 conda-forge
psutil 5.9.5 py39ha30fb19_0 conda-forge
pthread-stubs 0.4 hc929b4f_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
pyarrow 11.0.0 py39h105b94d_15_cpu conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pyct 0.4.6 py_0 conda-forge
pyct-core 0.4.6 py_0 conda-forge
pygments 2.15.1 pyhd8ed1ab_0 conda-forge
pyopenssl 23.1.1 pyhd8ed1ab_0 conda-forge
pyrsistent 0.19.3 py39ha30fb19_0 conda-forge
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.9.16 h709bd14_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-fastjsonschema 2.16.3 pyhd8ed1ab_0 conda-forge
python-json-logger 2.0.7 pyhd8ed1ab_0 conda-forge
python-snappy 0.6.1 py39hf74c2c1_0 conda-forge
python-tzdata 2023.3 pyhd8ed1ab_0 conda-forge
python_abi 3.9 3_cp39 conda-forge
pytz 2023.3 pyhd8ed1ab_0 conda-forge
pyyaml 6.0 py39ha30fb19_5 conda-forge
pyzmq 25.0.2 py39hed8f129_0 conda-forge
re2 2023.02.02 hf0c8a7f_0 conda-forge
readline 8.2 h9e318b2_1 conda-forge
requests 2.28.2 pyhd8ed1ab_1 conda-forge
rfc3339-validator 0.1.4 pyhd8ed1ab_0 conda-forge
rfc3986-validator 0.1.1 pyh9f0ad1d_0 conda-forge
scipy 1.10.1 py39h4c5e66d_0 conda-forge
send2trash 1.8.0 pyhd8ed1ab_0 conda-forge
setuptools 67.7.2 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.1.10 h225ccf5_0 conda-forge
sniffio 1.3.0 pyhd8ed1ab_0 conda-forge
sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge
soupsieve 2.3.2.post1 pyhd8ed1ab_0 conda-forge
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
tblib 1.7.0 pyhd8ed1ab_0 conda-forge
terminado 0.17.1 pyhd1c38e8_0 conda-forge
tinycss2 1.2.1 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h5dbffcc_0 conda-forge
toolz 0.12.0 pyhd8ed1ab_0 conda-forge
tornado 6.3 py39ha30fb19_0 conda-forge
traitlets 5.9.0 pyhd8ed1ab_0 conda-forge
typing-extensions 4.5.0 hd8ed1ab_0 conda-forge
typing_extensions 4.5.0 pyha770c72_0 conda-forge
tzdata 2023c h71feb2d_0 conda-forge
urllib3 1.26.15 pyhd8ed1ab_0 conda-forge
wcwidth 0.2.6 pyhd8ed1ab_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
websocket-client 1.5.1 pyhd8ed1ab_0 conda-forge
wheel 0.40.0 pyhd8ed1ab_0 conda-forge
xarray 2023.4.2 pyhd8ed1ab_0 conda-forge
xorg-libxau 1.0.9 h35c211d_0 conda-forge
xorg-libxdmcp 1.1.3 h35c211d_0 conda-forge
xyzservices 2023.2.0 pyhd8ed1ab_0 conda-forge
xz 5.2.6 h775f41a_0 conda-forge
yaml 0.2.5 h0d85af4_2 conda-forge
zeromq 4.3.4 he49afe7_1 conda-forge
zict 3.0.0 pyhd8ed1ab_0 conda-forge
zipp 3.15.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.13 hfd90126_4 conda-forge
zstd 1.5.2 hbc0c0cd_6 conda-forge
Upon investigation there is an assumption in datashader
's handling of categorical columns that each partition has its categories sorted in the same order. A categorical aggregation is 3D of shape (ny, nx, ncat)
where ncat
is the number of categories, and internally we don't a category directly but use its index into the sequence of categories. Each partition is internally consistent, but when combining the results from multiple partitions across the categories the difference in indexes combines them incorrectly resulting in different colors.
For the 2010 US census data loaded using pyarrow
the category order varies across the partitions (but is repeatable). Using fastparquet
the category orders are the same across all partitions, but this is different from the order of categories used for the colormapping (which happens at the dask dataframe level not individual partition level).
There is a related issue on dask
: https://github.com/dask/dask/issues/9467
We need to solve this within datashader
but there is fortunately a workaround. After the dask.dataframe.read_parquet
call add
df = df.categorize('race')
and the output is correct using either fastparquet
2023.2.0 or pyarrow
10.0.1.
I don't think that fastparquet should be recoding the column on load - it must be showing the real encoding in the files, the same across all of them. So what is pyarrow doing? No idea.
df = df.categorize('race')
Is there no cost associated with this?
Is there no cost associated with this?
There is significant cost in doing this.
Damn. I guess we need to move forward with the fix in Datashader, then.
The example works fine for dask <= 2022.7.0
and fails for dask >= 2022.7.1
. The explanation is in the dask documentation at the bottom of this page: https://docs.dask.org/en/stable/dataframe-categoricals.html. The important quote is "If you write and read to parquet, Dask will forget known categories. This happens because, due to performance concerns, all the categories are saved in every partition rather than in the parquet metadata" and this is followed by an explanation of how to deal with this which is something along the lines of
if not ddf.col.cat.known:
ddf.col = ddf.col.cat.set_categories(ddf.col.head(1).cat.categories)
where col
is a categorical column that we want to use.
We can replicate the error using the US census data, but this is too large for a repeatable test. We can also do a cycle of save to parquet followed by read from parquet to replicate it. But here is a simpler reproducer that we'll be able to add to the datashader test suite:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(data=dict(col = ['a', 'b', 'c', 'a', 'b', 'c', 'b', 'b', 'b', 'b', 'b', 'b']))
ddf = dd.from_pandas(df, npartitions=2)
ddf.col = ddf.col.astype('category')
for i in range(ddf.npartitions):
partition = ddf.get_partition(i)
print("Partition counts", i, dict(partition.col.value_counts().compute()))
which produces
Partition counts 0 {'a': 2, 'b': 2, 'c': 2}
Partition counts 1 {'b': 6}
If you use this in datashader, all the partition 1 'b' counts are assigned to categorical index 0 so they are combined with the partition 0 'a' counts, which is incorrect. Adding the recommended code from the dask docs:
if not ddf.col.cat.known:
ddf.col = ddf.col.cat.set_categories(ddf.col.head(1).cat.categories)
gives
Partition counts 0 {'a': 2, 'b': 2, 'c': 2}
Partition counts 1 {'b': 6, 'a': 0, 'c': 0}
which works as expected. (The order of entries is different above but the underlying Index
es have identical order and give the correct datashader output).
So the difference between fastparquet and pyarrow, is that fastparquet saves the pandas categories as-is, using the existing coding, and presumably must arrow re-code on save.
If I create an outdated environment with
conda create -n censusold python=3.7 notebook 'dask<2022.6.2' datashader 'fastparquet<2023.2.0' python-snappy 'pandas<2'
, categorical colormapping works fine for http://s3.amazonaws.com/datashader-data/census2010.parq.zip unpacked and used with this code:However, for the latest environment from conda-forge (
conda create -n censusnew -c conda-forge python=3.9 notebook dask datashader fastparquet python-snappy pandas
), I instead get colors completely mangled in a way that suggests getting different categories per dask partition:Note that the latest version on defaults (
conda create -n censusnew python=3.9 notebook dask datashader fastparquet python-snappy pandas
) just dies with an error, perhaps due to the very old fastparquet there: