holoviz-topics / examples

Visualization-focused examples of using HoloViz for specific topics
https://examples.holoviz.org
Creative Commons Attribution 4.0 International
82 stars 24 forks source link

Update pins on some examples #400

Open Azaya89 opened 3 months ago

Azaya89 commented 3 months ago

This PR updates some of the dependencies in the nyc_taxi and glaciers examples.

maximlt commented 3 months ago

New issue: it appears the .parq file in the nyc_taxi examples is no longer being read correctly by fastparquet.

This is a problem I also got in https://github.com/holoviz-topics/examples/pull/369. The last comment was:

Ok so I ended up with keeping pyarrow as the engine but adding this before the imports:

import dask

dask.config.set({"dataframe.convert-string": False})
dask.config.set({"dataframe.query-planning": False})

Since HoloViews does that too in its test suite, meaning that there's isn't yet "official" support for these two features (query planner and pyarrow string): https://github.com/holoviz/holoviews/blob/6b0121d5a3685989fca58a1687961523a5fd575c/holoviews/tests/conftest.py#L61-L62

However, since then, HoloViews no longer sets dask.config.set({"dataframe.query-planning": False}) (it still has dask.config.set({"dataframe.convert-string": False})).

https://github.com/holoviz/holoviews/blob/e5f7aede7a58902677eb995b8fd67c54ae9ae3ab/holoviews/tests/conftest.py#L55-L60

My suggestions:

jbednar commented 3 months ago

Note that in the past, pyarrow and fastparquet had very different performance from each other in certain workloads, so ideally you'd at least qualitatively compare the old pinned version with the new version, and make sure that performance has not significantly degraded.

Azaya89 commented 3 months ago
  • Try with engine='pyarrow' and see whether the notebook runs fine. Don't set any of the dask.config options yet, maybe it works without them.
  • If it doesn't work, start with dask.config.set({"dataframe.convert-string": False}).

I have tried each of the suggestions individually and all together but it still didn't work. It still shows the same Traceback error. Here's the full Traceback:

Full traceback

```python --------------------------------------------------------------------------- OSError Traceback (most recent call last) File :2 File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:348, in DaskMethodsMixin.persist(self, **kwargs) 309 def persist(self, **kwargs): 310 """Persist this dask collection into memory 311 312 This turns a lazy Dask collection into a Dask collection with the same (...) 346 dask.persist 347 """ --> 348 (result,) = persist(self, traverse=False, **kwargs) 349 return result File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/base.py:998, in persist(traverse, optimize_graph, scheduler, *args, **kwargs) 995 postpersists.append((rebuild, a_keys, state)) 997 with shorten_traceback(): --> 998 results = schedule(dsk, keys, **kwargs) 1000 d = dict(zip(keys, results)) 1001 results2 = [r({k: d[k] for k in ks}, *s) for r, ks, s in postpersists] File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:97, in ParquetFunctionWrapper.__call__(self, part) 94 if not isinstance(part, list): 95 part = [part] ---> 97 return read_parquet_part( 98 self.fs, 99 self.engine, 100 self.meta, 101 [ 102 # Temporary workaround for HLG serialization bug 103 # (see: https://github.com/dask/dask/issues/8581) 104 (p.data["piece"], p.data.get("kwargs", {})) 105 if hasattr(p, "data") 106 else (p["piece"], p.get("kwargs", {})) 107 for p in part 108 ], 109 self.columns, 110 self.index, 111 self.common_kwargs, 112 ) File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:645, in read_parquet_part(fs, engine, meta, part, columns, index, kwargs) 642 if len(part) == 1 or part[0][1] or not check_multi_support(engine): 643 # Part kwargs expected 644 func = engine.read_partition --> 645 dfs = [ 646 func( 647 fs, 648 rg, 649 columns.copy(), 650 index, 651 **toolz.merge(kwargs, kw), 652 ) 653 for (rg, kw) in part 654 ] 655 df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0] 656 else: 657 # No part specific kwargs, let engine read 658 # list of parts at once File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:646, in (.0) 642 if len(part) == 1 or part[0][1] or not check_multi_support(engine): 643 # Part kwargs expected 644 func = engine.read_partition 645 dfs = [ --> 646 func( 647 fs, 648 rg, 649 columns.copy(), 650 index, 651 **toolz.merge(kwargs, kw), 652 ) 653 for (rg, kw) in part 654 ] 655 df = concat(dfs, axis=0) if len(dfs) > 1 else dfs[0] 656 else: 657 # No part specific kwargs, let engine read 658 # list of parts at once File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:641, in ArrowDatasetEngine.read_partition(cls, fs, pieces, columns, index, dtype_backend, categories, partitions, filters, schema, **kwargs) 638 row_group = [row_group] 640 # Read in arrow table and convert to pandas --> 641 arrow_table = cls._read_table( 642 path_or_frag, 643 fs, 644 row_group, 645 columns, 646 schema, 647 filters, 648 partitions, 649 partition_keys, 650 **kwargs, 651 ) 652 if multi_read: 653 tables.append(arrow_table) File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1774, in ArrowDatasetEngine._read_table(cls, path_or_frag, fs, row_groups, columns, schema, filters, partitions, partition_keys, **kwargs) 1767 arrow_table = frag.to_table( 1768 use_threads=False, 1769 schema=schema, 1770 columns=cols, 1771 filter=_filters_to_expression(filters) if filters else None, 1772 ) 1773 else: -> 1774 arrow_table = _read_table_from_path( 1775 path_or_frag, 1776 fs, 1777 row_groups, 1778 columns, 1779 schema, 1780 filters, 1781 **kwargs, 1782 ) 1784 # For pyarrow.dataset api, if we did not read directly from 1785 # fragments, we need to add the partitioned columns here. 1786 if partitions and isinstance(partitions, list): File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:271, in _read_table_from_path(path, fs, row_groups, columns, schema, filters, **kwargs) 264 return pq.ParquetFile(fil, **pre_buffer).read( 265 columns=columns, 266 use_threads=False, 267 use_pandas_metadata=True, 268 **read_kwargs, 269 ) 270 else: --> 271 return pq.ParquetFile(fil, **pre_buffer).read_row_groups( 272 row_groups, 273 columns=columns, 274 use_threads=False, 275 use_pandas_metadata=True, 276 **read_kwargs, 277 ) File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/parquet/core.py:537, in ParquetFile.read_row_groups(self, row_groups, columns, use_threads, use_pandas_metadata) 495 """ 496 Read a multiple row groups from a Parquet file. 497 (...) 533 animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]] 534 """ 535 column_indices = self._get_column_indices( 536 columns, use_pandas_metadata=use_pandas_metadata) --> 537 return self.reader.read_row_groups(row_groups, 538 column_indices=column_indices, 539 use_threads=use_threads) File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/_parquet.pyx:1418, in pyarrow._parquet.ParquetReader.read_row_groups() File ~/Documents/development/holoviz-topics-examples/nyc_taxi/envs/default/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status() OSError: RLE encoding only supports BOOLEAN ```

maximlt commented 3 months ago

OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow.

Azaya89 commented 3 months ago

OK thanks for the report. It looks like the file cannot be read with pyarrow. We'll have to read it with fastparquet (for that dask-expr will have to be disabled), and save it again using pyarrow.

Can you guide me on how I can do this?

Suggesting Needed to avoid a warning emitted when datashader internally imports dask.dataframe import.

OK. I'll make it clearer.