Closed jtao1 closed 6 months ago
Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).
Okay a few things:
First:
x=easting
& y=northing
, ranges usually in the 10000s)latitude
& x=longitude
, usually ranges from -90 to 90N, & -180 to 180E respectively)So, based on that context, we should not be using geo=True
here because it's a simply a convenience method to set crs="PlateCarree"
and then internally the projection defaults to "PlateCarree"
too. Instead, we should be explicitly setting crs=ccrs.GOOGLE_MERCATOR
and projection=ccrs.GOOGLE_MERCATOR
to reduce ambiguity.
Second: The districts geometry coordinates are in degrees
That means, if you set geo=True
WITHOUT setting a projection that matches the original data, the map will be showing two different coordinates (districts in degrees, tiles & data in meters), and will not display one or the other properly. Set them consistently and you should see both overlaid properly.
But wait! It still doesn't show up?
That's because of z-ordering, i.e. tiles is hiding the districts! Change the ordering of how it gets multiplied and tada!
@jtao1 feel free to message me if any of this doesn't make sense, and I can elaborate more.
Other than that, for us maintainers, geoviews + datashader needs fixing; lots of bounciness with aspect when zooming in: https://github.com/holoviz/geoviews/issues/673
Thanks a lot @ahuang11 for diving into that, really appreciate it! Despite being a small example, gerrymandering definitely has some complexity.
Instead, we should be explicitly setting crs=ccrs.GOOGLE_MERCATOR and projection=ccrs.GOOGLE_MERCATOR to reduce ambiguity.
Do you happen to know if this is a no-op or if it leads to some more computation?
On districts, the original code was as follows. I somehow missed that the layer was projected to Web Mercator with gv.project
!
shape_path = './data/congressional_districts/cb_2015_us_cd114_5m.shp'
districts = gv.Shape.from_shapefile(shape_path, crs=crs.PlateCarree())
districts = gv.project(districts)
I assume the equivalent code with geopandas + hvPlot is this, using project=True
(hoping I don't need to set crs
since it's an attribute of the GeoDataFrame).
shape_path = './data/congressional_districts/cb_2015_us_cd114_5m.shp'
gdf_districts = gpd.read_file(shape_path)
districts = gdf_districts.hvplot.polygons(project=True, color=None)
By the way, I'm the one who introduced GeoPandas in this example (see https://github.com/holoviz-topics/examples/pull/365 that was meant to help Jason getting started 🙈 ). That was quite an arbitrary decision. But once it's there, perhaps the most natural thing to do is just to re-project with GeoPandas directly? This means that we entirely avoid using GeoViews.
shape_path = './data/congressional_districts/cb_2015_us_cd114_5m.shp'
gdf_districts = gpd.read_file(shape_path)
gdf_districts = gdf_districts.to_crs(epsg=3857)
districts = gdf_districts.hvplot.polygons(color=None)
I think the example should talk a bit more about projects. For example, it doesn't say anything about the projection of the census dataset.
@ahuang11 feel free to ignore the suggestions I made to modernize this example, as I'm not quite sure what's the Holoviz-y way to implement it these days. The less users have to think about projections, the better, but in this example, they have to.
Seems like No op?
I tried running it in different order to ensure it wasn't cached on the second run, and it's actually faster (I think without it, internally has to make some guesses on which crs/projection)
I personally prefer explicit over implicit (zen of python)
The less users have to think about projections, the better, but in this example, they have to.
I think if they are unaware, it leads to confusion, e.g. Jason's comment about geo=True breaks it: https://github.com/holoviz-topics/examples/pull/369#discussion_r1531517198
@jtao1, since Maxime and I have differing opinions, feel free to experiment and see which one is more natural to you as a newcomer and go that route since these examples are meant for other newcomers as well.
@jtao1, since Maxime and I have differing opinions, feel free to experiment and see which one is more natural to you as a newcomer and go that route since these examples are meant for other newcomers as well.
Oh I don't think we have different opinions since I don't have a strong opinion there, I'm just unsure about which approach to follow. I find it curious that before updating the project to hvPlot the code was never explicitly setting GOOGLE_MERCATOR
(see https://examples.holoviz.org/gallery/gerrymandering/gerrymandering.html) but now we do 🤔 @jtao1 I guess that's a good topic to discuss during the meeting this week.
From the old example:
the code was never explicitly setting GOOGLE_MERCATOR
hv.Points(df, kdims=['easting', 'northing'], vdims=['race'])
the data is already in easting/northing
tiles = gv.WMTS('https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg')
already is easting/northing
The only thing unclear to me was: districts = gv.project(districts)
, which apparently this defaults to GOOGLE_MERCATOR
We talked about Gerrymandering during our weekly meeting today:
df.hvplot.points(..., crs=ccrs.GOOGLE_MERCATOR, projection=ccrs.GOOGLE_MERCATOR)
is a pattern that we don't really want to expose, i.e.df.hvplot.points(...)
is preferred.project=True
(e.g. gdf.hvplot.polygons(..., project=True)
) to project once the US districts to Web Mercator. Doing so, GeoViews doesn't need to be imported.project=True
:
.crs
property of the GeoDataFrame) upfront with project=True
to pay the cost of projecting only once.I have opened https://github.com/holoviz/hvplot/pull/1299 to add a tiles_opts
parameter to hvPlot to replace opts.WMTS(alpha=0.5)
in the old example with tiles_opts={'alpha': 0.5}
. I also plan to implement https://github.com/holoviz/hvplot/issues/1289 to add support to more tiles to hvPlot. I think it will be useful for this example, the current background map has a color palette that is somewhat mixed the data color palette (greens for example), I think we should find a more appropriate background map. These two changes will be released in hvPlot 0.9.3, hopefully quite soon! @jtao1 I'll let you know when it's released so you can update the project (code, and project and lock file).
Please also evaluate the performance of interacting (zooming in and out, panning) with the last plot (combined tiles, points, and districts). If it's not great, you can try to rasterize the districts with <>.hvplot.polygons(..., rasterize=True)
and see if it improves things. Please report back, a GIF would definitely help us gauge the performance of the plot.
Final plot performance video: https://drive.google.com/file/d/1rNqRKkDsaCoC6NM0lBk8R6bIa6ReE7f8/view?usp=sharing
Hey @jtao1 , if you drag a video into a comment box, it should upload and render when posted. Much better approach than having to host externally.
https://github.com/holoviz-topics/examples/assets/6613202/cbe5dd45-28e8-4368-ac64-d87858b084df
@jtao1 please make these changes so I can review this PR:
I also spotted a couple of typos so please check the notebook again.
Once you are done, ping me so I can start reviewing.
Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).
Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).
Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).
Notebook failing to load the parquet file on the CI with ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type
, not on my machine though. I need to investigate.
Going to dump here some notes about the dask/parquet issue.
I made this project "compliant" with the new infrastructure end of 2023. Before that, it used to rely on fastparquet 0.5.0
. However, in a newer environment with fastparquet=='2023.8.0' I could no longer read the file, instead, I got DecompressionError: snappy: corrupt input (header mismatch; expected 76668764 decompressed bytes but got 76668681)
. At the time I tried to replace the engine with PyArrow and got another error: OSError: Unexpected end of stream: Page was smaller (26787590) than expected (26787635)
.
The old file was a 1.36Gb zip archive, with its content dating from May 2017:
I asked Martin for some help and he kindly "fixed" the file to work with the newer version of fastparquet. This is the file currently available on S3.
I re-built an environment with this older version of fastparquet and loaded the dataset (found the old version in my hard disk):
Note the dtype of race
is category
.
Some more info:
The new file is a 1.41Gb zip archive with the same content structure:
With the new file and newer version of fastparquet (so before any change made in this PR):
And some more info:
Note the dtype of race
is no longer category
but object
. Somehow, the categorical dtype was lost in the file and/or library update.
When the project was recently relocked, it pulled the latest version of Dask available on the defaults channel (dask=2024.5.0, before this PR dask=2023.6.0). A couple of versions before, Dask migrated the implementation of dask.dataframe
to enable query planning, based on a rewrite of DataFrame
using dask-expr
. More details in https://docs.dask.org/en/stable/changelog.html#v2024-3-0.
Also, in version 2024.1.0, Dask deprecated the fastparquet engine. More details in https://docs.dask.org/en/stable/changelog.html#v2024-1-0.
So, with this new version of Dask reading the Parquet file broke with:
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:1491, in _set_parquet_engine(engine, meta)
1488 def _set_parquet_engine(engine=None, meta=None):
1489 # Use `engine` or `meta` input to set the parquet engine
1490 if engine == "fastparquet":
-> 1491 raise NotImplementedError("Fastparquet engine is not supported")
1493 if engine is None:
1494 if (
1495 meta is not None and typename(meta).split(".")[0] == "cudf"
1496 ) or dask.config.get("dataframe.backend", "pandas") == "cudf":
NotImplementedError: Fastparquet engine is not supported
I then just tried to read the parquet file using the pyarrow engine. It worked locally on my Mac:
However, it failed on the CI and I don't know why:
ValueError: ArrowStringArray requires a PyArrow (chunked) array of large_string type
```python
ValueError Traceback (most recent call last)
Cell In[1], line 1
----> 1 df = dd.read_parquet('./data/census.snappy.parq', engine='pyarrow')
2 df = df.persist()
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/_collection.py:5433, in read_parquet(path, columns, filters, categories, index, storage_options, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, engine, arrow_to_pandas, **kwargs)
5410 raise NotImplementedError(
5411 "engine is not supported when using the pyarrow filesystem."
5412 )
5414 return new_collection(
5415 ReadParquetPyarrowFS(
5416 path,
(...)
5429 )
5430 )
5432 return new_collection(
-> 5433 ReadParquetFSSpec(
5434 path,
5435 columns=_convert_to_list(columns),
5436 filters=filters,
5437 categories=categories,
5438 index=index,
5439 storage_options=storage_options,
5440 calculate_divisions=calculate_divisions,
5441 ignore_metadata_file=ignore_metadata_file,
5442 metadata_task_size=metadata_task_size,
5443 split_row_groups=split_row_groups,
5444 blocksize=blocksize,
5445 aggregate_files=aggregate_files,
5446 parquet_file_extension=parquet_file_extension,
5447 filesystem=filesystem,
5448 engine=_set_parquet_engine(engine),
5449 kwargs=kwargs,
5450 _series=isinstance(columns, str),
5451 )
5452 )
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/_core.py:57, in Expr.__new__(cls, *args, **kwargs)
55 inst = object.__new__(cls)
56 inst.operands = [_unpack_collections(o) for o in operands]
---> 57 _name = inst._name
58 if _name in Expr._instances:
59 return Expr._instances[_name]
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
999 val = cache.get(self.attrname, _NOT_FOUND)
1000 if val is _NOT_FOUND:
-> 1001 val = self.func(instance)
1002 try:
1003 cache[self.attrname] = val
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:776, in ReadParquet._name(self)
770 @cached_property
771 def _name(self):
772 return (
773 self._funcname
774 + "-"
775 + _tokenize_deterministic(
--> 776 funcname(type(self)), self.checksum, *self.operands[:-1]
777 )
778 )
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:782, in ReadParquet.checksum(self)
780 @property
781 def checksum(self):
--> 782 return self._dataset_info["checksum"]
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask_expr/io/parquet.py:1375, in ReadParquetFSSpec._dataset_info(self)
1372 dataset_info["checksum"] = tokenize(checksum)
1374 # Infer meta, accounting for index and columns arguments.
-> 1375 meta = self.engine._create_dd_meta(dataset_info)
1376 index = dataset_info["index"]
1377 index = [index] if isinstance(index, str) else index
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1215, in ArrowDatasetEngine._create_dd_meta(cls, dataset_info)
1213 convert_string = dataset_info["kwargs"]["convert_string"]
1214 dtype_backend = dataset_info["kwargs"]["dtype_backend"]
-> 1215 meta = cls._arrow_table_to_pandas(
1216 schema.empty_table(),
1217 categories,
1218 arrow_to_pandas=arrow_to_pandas,
1219 dtype_backend=dtype_backend,
1220 convert_string=convert_string,
1221 )
1222 index_names = list(meta.index.names)
1223 column_names = list(meta.columns)
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1878, in ArrowDatasetEngine._arrow_table_to_pandas(cls, arrow_table, categories, dtype_backend, convert_string, **kwargs)
1875 if types_mapper is not None:
1876 _kwargs["types_mapper"] = types_mapper
-> 1878 res = arrow_table.to_pandas(categories=categories, **_kwargs)
1879 # TODO: remove this when fixed in pyarrow: https://github.com/apache/arrow/issues/34283
1880 if (
1881 convert_string
1882 and isinstance(res.index, pd.Index)
(...)
1886 not in (pd.StringDtype("pyarrow"), pd.ArrowDtype(pa.string()))
1887 ):
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/array.pxi:884, in pyarrow.lib._PandasConvertible.to_pandas()
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/table.pxi:4192, in pyarrow.lib.Table._to_pandas()
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:776, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper)
774 _check_data_column_metadata_consistency(all_columns)
775 columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 776 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
778 axes = [columns, index]
779 mgr = BlockManager(blocks, axes)
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:1131, in _table_to_blocks(options, block_table, categories, extension_columns)
1128 columns = block_table.column_names
1129 result = pa.lib.table_to_blocks(options, block_table, categories,
1130 list(extension_columns.keys()))
-> 1131 return [_reconstruct_block(item, columns, extension_columns)
1132 for item in result]
File ~/work/examples/examples/gerrymandering/envs/default/lib/python3.11/site-packages/pyarrow/pandas_compat.py:1131, in
I was able to reproduce the traceback observed on the CI by explicitly setting categories
with:
df = dd.read_parquet('./data/census.snappy.parq', engine='pyarrow', categories=['race'])
Also note that the dtype of race
is now string[pyarrow]
and no longer object
. I'm not sure to which extent hvPlot/HoloViews is fine with handling this pyarrow dtype.
https://github.com/holoviz/datashader/issues/1202 and https://github.com/holoviz/datashader/pull/1239 have some interesting information about categorical handling in dask / datashader.
Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).
Ok so I ended up with keeping pyarrow
as the engine but adding this before the imports:
import dask
dask.config.set({"dataframe.convert-string": False})
dask.config.set({"dataframe.query-planning": False})
Since HoloViews does that too in its test suite, meaning that there's isn't yet "official" support for these two features (query planner and pyarrow string): https://github.com/holoviz/holoviews/blob/6b0121d5a3685989fca58a1687961523a5fd575c/holoviews/tests/conftest.py#L61-L62
Your changes were successfully integrated in the dev site, make sure to review the pages of the projects you touched before merging this PR: https://holoviz-dev.github.io/examples/. You can also download an archive of the site from the workflow summary page which comes in handy when your dev site built was overriden by another PR (we have a single dev site!).
I created a new branch that includes the changes the starter code from Maxime.