Modernize Gerrymandering [new]

jtao1 commented 3 months ago

I created a new branch that includes the changes the starter code from Maxime.

github-actions[bot] commented 3 months ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).

ahuang11 commented 3 months ago

Okay a few things:

First:

the data coordinates are natively in Mercator coordinates (meters, x=easting & y=northing, ranges usually in the 10000s)
tiles are also natively in Mercator coordinates
this is different than PlateCarree coordinates (degrees y=latitude & x=longitude, usually ranges from -90 to 90N, & -180 to 180E respectively)

So, based on that context, we should not be using geo=True here because it's a simply a convenience method to set crs="PlateCarree" and then internally the projection defaults to "PlateCarree" too. Instead, we should be explicitly setting crs=ccrs.GOOGLE_MERCATOR and projection=ccrs.GOOGLE_MERCATOR to reduce ambiguity.

Second: The districts geometry coordinates are in degrees

That means, if you set geo=True WITHOUT setting a projection that matches the original data, the map will be showing two different coordinates (districts in degrees, tiles & data in meters), and will not display one or the other properly. Set them consistently and you should see both overlaid properly.

But wait! It still doesn't show up?

That's because of z-ordering, i.e. tiles is hiding the districts! Change the ordering of how it gets multiplied and tada!

@jtao1 feel free to message me if any of this doesn't make sense, and I can elaborate more.

Other than that, for us maintainers, geoviews + datashader needs fixing; lots of bounciness with aspect when zooming in: https://github.com/holoviz/geoviews/issues/673

maximlt commented 3 months ago

Thanks a lot @ahuang11 for diving into that, really appreciate it! Despite being a small example, gerrymandering definitely has some complexity.

Instead, we should be explicitly setting crs=ccrs.GOOGLE_MERCATOR and projection=ccrs.GOOGLE_MERCATOR to reduce ambiguity.

Do you happen to know if this is a no-op or if it leads to some more computation?

On districts, the original code was as follows. I somehow missed that the layer was projected to Web Mercator with gv.project!

shape_path = './data/congressional_districts/cb_2015_us_cd114_5m.shp'
districts = gv.Shape.from_shapefile(shape_path, crs=crs.PlateCarree())
districts = gv.project(districts)

I assume the equivalent code with geopandas + hvPlot is this, using project=True (hoping I don't need to set crs since it's an attribute of the GeoDataFrame).

shape_path = './data/congressional_districts/cb_2015_us_cd114_5m.shp'
gdf_districts = gpd.read_file(shape_path)
districts = gdf_districts.hvplot.polygons(project=True, color=None)

By the way, I'm the one who introduced GeoPandas in this example (see https://github.com/holoviz-topics/examples/pull/365 that was meant to help Jason getting started 🙈 ). That was quite an arbitrary decision. But once it's there, perhaps the most natural thing to do is just to re-project with GeoPandas directly? This means that we entirely avoid using GeoViews.

shape_path = './data/congressional_districts/cb_2015_us_cd114_5m.shp'
gdf_districts = gpd.read_file(shape_path)
gdf_districts = gdf_districts.to_crs(epsg=3857)
districts = gdf_districts.hvplot.polygons(color=None)

I think the example should talk a bit more about projects. For example, it doesn't say anything about the projection of the census dataset.

@ahuang11 feel free to ignore the suggestions I made to modernize this example, as I'm not quite sure what's the Holoviz-y way to implement it these days. The less users have to think about projections, the better, but in this example, they have to.

ahuang11 commented 3 months ago

Seems like No op?

I tried running it in different order to ensure it wasn't cached on the second run, and it's actually faster (I think without it, internally has to make some guesses on which crs/projection)

I personally prefer explicit over implicit (zen of python)

The less users have to think about projections, the better, but in this example, they have to.

I think if they are unaware, it leads to confusion, e.g. Jason's comment about geo=True breaks it: https://github.com/holoviz-topics/examples/pull/369#discussion_r1531517198

@jtao1, since Maxime and I have differing opinions, feel free to experiment and see which one is more natural to you as a newcomer and go that route since these examples are meant for other newcomers as well.

maximlt commented 3 months ago

@jtao1, since Maxime and I have differing opinions, feel free to experiment and see which one is more natural to you as a newcomer and go that route since these examples are meant for other newcomers as well.

Oh I don't think we have different opinions since I don't have a strong opinion there, I'm just unsure about which approach to follow. I find it curious that before updating the project to hvPlot the code was never explicitly setting GOOGLE_MERCATOR (see https://examples.holoviz.org/gallery/gerrymandering/gerrymandering.html) but now we do 🤔 @jtao1 I guess that's a good topic to discuss during the meeting this week.

ahuang11 commented 3 months ago

From the old example:

the code was never explicitly setting GOOGLE_MERCATOR

hv.Points(df, kdims=['easting', 'northing'], vdims=['race']) the data is already in easting/northing

tiles = gv.WMTS('https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg') already is easting/northing

The only thing unclear to me was: districts = gv.project(districts), which apparently this defaults to GOOGLE_MERCATOR

maximlt commented 3 months ago

We talked about Gerrymandering during our weekly meeting today:

df.hvplot.points(..., crs=ccrs.GOOGLE_MERCATOR, projection=ccrs.GOOGLE_MERCATOR) is a pattern that we don't really want to expose, i.e.df.hvplot.points(...) is preferred.
Use project=True (e.g. gdf.hvplot.polygons(..., project=True)) to project once the US districts to Web Mercator. Doing so, GeoViews doesn't need to be imported.
I would say, try to adapt the text to make sense of the projections and help the reader understand why we set project=True:
- Web tiles in Bokeh are displayed in easting/northing i.e. Web Mercator. Therefore, it's the "natural" projection for displaying geo data in Bokeh.
- The census dataset is already in this projection.
- We project the districts, which were in another projection (you can check the .crs property of the GeoDataFrame) upfront with project=True to pay the cost of projecting only once.

I have opened https://github.com/holoviz/hvplot/pull/1299 to add a tiles_opts parameter to hvPlot to replace opts.WMTS(alpha=0.5) in the old example with tiles_opts={'alpha': 0.5}. I also plan to implement https://github.com/holoviz/hvplot/issues/1289 to add support to more tiles to hvPlot. I think it will be useful for this example, the current background map has a color palette that is somewhat mixed the data color palette (greens for example), I think we should find a more appropriate background map. These two changes will be released in hvPlot 0.9.3, hopefully quite soon! @jtao1 I'll let you know when it's released so you can update the project (code, and project and lock file).

Please also evaluate the performance of interacting (zooming in and out, panning) with the last plot (combined tiles, points, and districts). If it's not great, you can try to rasterize the districts with <>.hvplot.polygons(..., rasterize=True) and see if it improves things. Please report back, a GIF would definitely help us gauge the performance of the plot.

jtao1 commented 2 months ago

Final plot performance video: https://drive.google.com/file/d/1rNqRKkDsaCoC6NM0lBk8R6bIa6ReE7f8/view?usp=sharing

droumis commented 2 months ago

Hey @jtao1 , if you drag a video into a comment box, it should upload and render when posted. Much better approach than having to host externally.

https://github.com/holoviz-topics/examples/assets/6613202/cbe5dd45-28e8-4368-ac64-d87858b084df

maximlt commented 2 months ago

@jtao1 please make these changes so I can review this PR:

Clear the notebook from its outputs
Revert the changes you made in https://github.com/holoviz-topics/examples/pull/369/commits/39a4f5c2c3536fd8846f4670de2f43ed4b4b8ebb as they are a about portfolio_optimizer

I also spotted a couple of typos so please check the notebook again.

Once you are done, ping me so I can start reviewing.

github-actions[bot] commented 2 months ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).

github-actions[bot] commented 1 month ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).

github-actions[bot] commented 1 month ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).

maximlt commented 1 month ago

Notebook failing to load the parquet file on the CI with ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type, not on my machine though. I need to investigate.

maximlt commented 1 month ago

Going to dump here some notes about the dask/parquet issue.

I made this project "compliant" with the new infrastructure end of 2023. Before that, it used to rely on fastparquet 0.5.0. However, in a newer environment with fastparquet=='2023.8.0' I could no longer read the file, instead, I got DecompressionError: snappy: corrupt input (header mismatch; expected 76668764 decompressed bytes but got 76668681). At the time I tried to replace the engine with PyArrow and got another error: OSError: Unexpected end of stream: Page was smaller (26787590) than expected (26787635).

The old file was a 1.36Gb zip archive, with its content dating from May 2017:

I asked Martin for some help and he kindly "fixed" the file to work with the newer version of fastparquet. This is the file currently available on S3.

I re-built an environment with this older version of fastparquet and loaded the dataset (found the old version in my hard disk):

Note the dtype of race is category.

Some more info:

The new file is a 1.41Gb zip archive with the same content structure:

With the new file and newer version of fastparquet (so before any change made in this PR):

And some more info:

Note the dtype of race is no longer category but object. Somehow, the categorical dtype was lost in the file and/or library update.

When the project was recently relocked, it pulled the latest version of Dask available on the defaults channel (dask=2024.5.0, before this PR dask=2023.6.0). A couple of versions before, Dask migrated the implementation of dask.dataframe to enable query planning, based on a rewrite of DataFrame using dask-expr. More details in https://docs.dask.org/en/stable/changelog.html#v2024-3-0.

Also, in version 2024.1.0, Dask deprecated the fastparquet engine. More details in https://docs.dask.org/en/stable/changelog.html#v2024-1-0.

So, with this new version of Dask reading the Parquet file broke with:

File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:1491, in _set_parquet_engine(engine, meta)
   1488 def _set_parquet_engine(engine=None, meta=None):
   1489     # Use `engine` or `meta` input to set the parquet engine
   1490     if engine == "fastparquet":
-> 1491         raise NotImplementedError("Fastparquet engine is not supported")
   1493     if engine is None:
   1494         if (
   1495             meta is not None and typename(meta).split(".")[0] == "cudf"
   1496         ) or dask.config.get("dataframe.backend", "pandas") == "cudf":

NotImplementedError: Fastparquet engine is not supported

I then just tried to read the parquet file using the pyarrow engine. It worked locally on my Mac:

However, it failed on the CI and I don't know why:

ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type

Full traceback

```python ValueError Traceback (most recent call last) Cell In[1], line 1 ----> 1 df = dd.read_parquet('./data/census.snappy.parq', engine='pyarrow') 2 df = df.persist() File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/_collection.py:5433, in read_parquet(path, columns, filters, categories, index, storage_options, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, engine, arrow_to_pandas, **kwargs) 5410 raise NotImplementedError( 5411 "engine is not supported when using the pyarrow filesystem." 5412 ) 5414 return new_collection( 5415 ReadParquetPyarrowFS( 5416 path, (...) 5429 ) 5430 ) 5432 return new_collection( -> 5433 ReadParquetFSSpec( 5434 path, 5435 columns=_convert_to_list(columns), 5436 filters=filters, 5437 categories=categories, 5438 index=index, 5439 storage_options=storage_options, 5440 calculate_divisions=calculate_divisions, 5441 ignore_metadata_file=ignore_metadata_file, 5442 metadata_task_size=metadata_task_size, 5443 split_row_groups=split_row_groups, 5444 blocksize=blocksize, 5445 aggregate_files=aggregate_files, 5446 parquet_file_extension=parquet_file_extension, 5447 filesystem=filesystem, 5448 engine=_set_parquet_engine(engine), 5449 kwargs=kwargs, 5450 _series=isinstance(columns, str), 5451 ) 5452 ) File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/_core.py:57, in Expr.__new__(cls, *args, **kwargs) 55 inst = object.__new__(cls) 56 inst.operands = [_unpack_collections(o) for o in operands] ---> 57 _name = inst._name 58 if _name in Expr._instances: 59 return Expr._instances[_name] File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner) 999 val = cache.get(self.attrname, _NOT_FOUND) 1000 if val is _NOT_FOUND: -> 1001 val = self.func(instance) 1002 try: 1003 cache[self.attrname] = val File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:776, in ReadParquet._name(self) 770 @cached_property 771 def _name(self): 772 return ( 773 self._funcname 774 + "-" 775 + _tokenize_deterministic( --> 776 funcname(type(self)), self.checksum, *self.operands[:-1] 777 ) 778 ) File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:782, in ReadParquet.checksum(self) 780 @property 781 def checksum(self): --> 782 return self._dataset_info["checksum"] File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:1375, in ReadParquetFSSpec._dataset_info(self) 1372 dataset_info["checksum"] = tokenize(checksum) 1374 # Infer meta, accounting for index and columns arguments. -> 1375 meta = self.engine._create_dd_meta(dataset_info) 1376 index = dataset_info["index"] 1377 index = [index] if isinstance(index, str) else index File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1215, in ArrowDatasetEngine._create_dd_meta(cls, dataset_info) 1213 convert_string = dataset_info["kwargs"]["convert_string"] 1214 dtype_backend = dataset_info["kwargs"]["dtype_backend"] -> 1215 meta = cls._arrow_table_to_pandas( 1216 schema.empty_table(), 1217 categories, 1218 arrow_to_pandas=arrow_to_pandas, 1219 dtype_backend=dtype_backend, 1220 convert_string=convert_string, 1221 ) 1222 index_names = list(meta.index.names) 1223 column_names = list(meta.columns) File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1878, in ArrowDatasetEngine._arrow_table_to_pandas(cls, arrow_table, categories, dtype_backend, convert_string, **kwargs) 1875 if types_mapper is not None: 1876 _kwargs["types_mapper"] = types_mapper -> 1878 res = arrow_table.to_pandas(categories=categories, **_kwargs) 1879 # TODO: remove this when fixed in pyarrow: https://github.com/apache/arrow/issues/34283 1880 if ( 1881 convert_string 1882 and isinstance(res.index, pd.Index) (...) 1886 not in (pd.StringDtype("pyarrow"), pd.ArrowDtype(pa.string())) 1887 ): File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/array.pxi:884, in pyarrow.lib._PandasConvertible.to_pandas() File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/table.pxi:4192, in pyarrow.lib.Table._to_pandas() File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:776, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper) 774 _check_data_column_metadata_consistency(all_columns) 775 columns = _deserialize_column_index(table, all_columns, column_indexes) --> 776 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) 778 axes = [columns, index] 779 mgr = BlockManager(blocks, axes) File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:1131, in _table_to_blocks(options, block_table, categories, extension_columns) 1128 columns = block_table.column_names 1129 result = pa.lib.table_to_blocks(options, block_table, categories, 1130 list(extension_columns.keys())) -> 1131 return [_reconstruct_block(item, columns, extension_columns) 1132 for item in result] File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:1131, in (.0) 1128 columns = block_table.column_names 1129 result = pa.lib.table_to_blocks(options, block_table, categories, 1130 list(extension_columns.keys())) -> 1131 return [_reconstruct_block(item, columns, extension_columns) 1132 for item in result] File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:736, in _reconstruct_block(item, columns, extension_columns) 733 if not hasattr(pandas_dtype, '__from_arrow__'): 734 raise ValueError("This column does not support to be converted " 735 "to a pandas ExtensionArray") --> 736 pd_ext_arr = pandas_dtype.__from_arrow__(arr) 737 block = _int.make_block(pd_ext_arr, placement=placement) 738 else: File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pandas/core/arrays/string_.py:217, in StringDtype.__from_arrow__(self, array) 214 if self.storage == "pyarrow": 215 from pandas.core.arrays.string_arrow import ArrowStringArray --> 217 return ArrowStringArray(array) 218 elif self.storage == "pyarrow_numpy": 219 from pandas.core.arrays.string_arrow import ArrowStringArrayNumpySemantics File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pandas/core/arrays/string_arrow.py:143, in ArrowStringArray.__init__(self, values) 137 self._dtype = StringDtype(storage=self._storage) 139 if not pa.types.is_large_string(self._pa_array.type) and not ( 140 pa.types.is_dictionary(self._pa_array.type) 141 and pa.types.is_large_string(self._pa_array.type.value_type) 142 ): --> 143 raise ValueError( 144 "ArrowStringArray requires a PyArrow (chunked) array of " 145 "large_string type" 146 ) ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type ```

I was able to reproduce the traceback observed on the CI by explicitly setting categories with:

df = dd.read_parquet('./data/census.snappy.parq', engine='pyarrow', categories=['race'])

Also note that the dtype of race is now string[pyarrow] and no longer object. I'm not sure to which extent hvPlot/HoloViews is fine with handling this pyarrow dtype.

maximlt commented 1 month ago

https://github.com/holoviz/datashader/issues/1202 and https://github.com/holoviz/datashader/pull/1239 have some interesting information about categorical handling in dask / datashader.

github-actions[bot] commented 1 month ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).

maximlt commented 1 month ago

Ok so I ended up with keeping pyarrow as the engine but adding this before the imports:

import dask

dask.config.set({"dataframe.convert-string": False})
dask.config.set({"dataframe.query-planning": False})

Since HoloViews does that too in its test suite, meaning that there's isn't yet "official" support for these two features (query planner and pyarrow string): https://github.com/holoviz/holoviews/blob/6b0121d5a3685989fca58a1687961523a5fd575c/holoviews/tests/conftest.py#L61-L62

github-actions[bot] commented 1 month ago

Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).

holoviz-topics / examples

Modernize Gerrymandering [new] #369