Open Azaya89 opened 6 months ago
@Azaya89 thanks for this very nice bug report!
Could you also report the timings for the equivalent of this line gdf.hvplot.polygons(tiles='CartoLight', rasterize=True)
but with SpatialPandas? I would like to know how much slower things got.
Here it is.
Oh I'm pretty sure it definitely takes more for the plot to render than 542 us. Can you get an estimate of the real time it takes for the plot to render? Also, I notice another difference is that the hvPlot call uses rasterize
while in the last snippet it uses datashade
.
Oh I'm pretty sure it definitely takes more for the plot to render than 542 us. Can you get an estimate of the real time it takes for the plot to render?
How do you mean? is it different from using timeit
on the plot objects?
How do you mean? is it different from using timeit on the plot objects?
Yes, the render time is different. Feel free to just give a rough estimate of when the plot is done rendering on screen
Yes, the render time is different. Feel free to just give a rough estimate of when the plot is done rendering on screen
OK, I timed it myself and it took ~5 secs to run this cell and render the plots: tiles * shaded * legend * hover
Can you try to run this with HoloViews 1.19.0a2?
Can you try to run this with HoloViews 1.19.0a2?
I did and it ran significantly faster. However, adding the other parameters still caused the same error as before.
Coming back to the issue reported with:
gdf.head(1000).hvplot.polygons(tiles='CartoLight', rasterize=True, c='type', cmap=color_key)
@Azaya89 you haven't shared how color_key
is derived in your example. Focusing on this bit of code from the NYC Buildings example:
cats = list(ddf.type.value_counts().compute().iloc[:10].index.values) + ['unknown']
ddf['type'] = ddf.type.replace({None: 'unknown'})
ddf = ddf[ddf.type.isin(cats)]
ddf['type'] = ddf['type'].astype('category').cat.as_known()
There, ddf
is a spatialpandas.geodataframe.GeoDataFrame
object. In your updated code, gdf
is a geopandas.GeoDataFrame
object. The latter has a type
property that returns the geometry type of each geometry in the GeoSeries.
Therefore to access the column data you need to use __getitem__
/ the [] syntax with gdf['type']
.
Then, to get the plot displayed I had to add aggregator='count_cat'
to c='type'
. Somehow, I would have expected setting these two parameters to be identical to setting by='type'
but it didn't work. Something to discuss I guess.
So here's the full code:
```python import colorcet as cc import datashader as ds import geopandas as gpd import hvplot.pandas gdf = gpd.read_parquet('new_nyc_buildings.parq') cats = list(gdf['type'].value_counts().iloc[:10].index.values) + ['unknown'] gdf['type'] = gdf['type'].replace({None: 'unknown'}) gdf = gdf[gdf['type'].isin(cats)] colors = cc.glasbey_bw_minc_20_maxl_70 color_key = {cat: tuple(int(e*255.) for e in colors[i]) for i, cat in enumerate(cats)} gdf.hvplot.polygons( tiles='CartoLight', data_aspect=1, datashade=True, aggregator=ds.by('type'), cmap=color_key ) ```
Alternatively, this would also work:
gdf.hvplot.polygons(
tiles='CartoLight', data_aspect=1,
datashade=True, aggregator='count_cat', c='type', cmap=color_key
)
In the NYC Buildings example another aggregator is used with ds.by('type', ds.any())
. If I try to use that instead of ds.by('type')
I get ValueError: input must be categorical
. Turning the type column into a categorical one fixes that, and the plot looks much closer to the plot currently displayed in the example:
```python gdf['type'] = gdf['type'].astype('category') gdf.hvplot.polygons( tiles='CartoLight', data_aspect=1, datashade=True, aggregator=ds.by('type', ds.any()), cmap=color_key ) ```
Finally, setting rasterize=True
instead of datashade=True
generates a plot that is far from what we'd expect:
And indeed, taking another simpler example and using HoloViews only, I can see that the output of a rasterize
operation is not the expected one.
```python import geopandas as gpd import geodatasets import datashader as ds import holoviews as hv import spatialpandas as spd from holoviews.operation.datashader import datashade, rasterize hv.extension('bokeh') path = geodatasets.get_path("geoda.nyc_neighborhoods") nyc = gpd.read_file(path) nyc['boroname'] = nyc['boroname'].astype('category') # required spd_nyc = spd.GeoDataFrame(nyc) polys = hv.Polygons(spd_nyc, vdims='boroname') shaded = datashade(polys, aggregator=ds.by('boroname', ds.any())) rasterized = rasterize(polys, aggregator=ds.by('boroname', ds.any())) polys + shaded + rasterized ```
@Azaya89 you haven't shared how color_key is derived in your example.
Here's how it was constructed:
colors = cc.glasbey_bw_minc_20_maxl_70
color_key = {cat: tuple(int(e*255.) for e in colors[i]) for i, cat in enumerate(cats)}
The same as the one you shared.
Thank you for this. I have incorporated it into my PR now, although I'm now curious why rasterize
and datashade
are giving very different outputs...
@Azaya89 you haven't shared how color_key is derived in your example.
Here's how it was constructed:
colors = cc.glasbey_bw_minc_20_maxl_70 color_key = {cat: tuple(int(e*255.) for e in colors[i]) for i, cat in enumerate(cats)}
The same as the one you shared.
Thank you for this. I have incorporated it into my PR now, although I'm now curious why
rasterize
anddatashade
are giving very different outputs...
@Azaya89 what was important is how you computed cats
.
although I'm now curious why rasterize and datashade are giving very different outputs...
This is a bug in HoloViews I think, I need to open a bug report.
@Azaya89 what was important is how you computed
cats
.
OK. Here:
cats = list(gdf['type'].value_counts().iloc[:10].index) + ['unknown']
ALL software version info
Description of expected behavior and the observed behavior
As part of the NumFOCUS SDG, I am modernizing the nyc_building example on the examples website to use the latest APIs. This example also involves using
geopandas
instead ofspatialpandas
to read the data stored as a parquet (.parq
) file.Switching from
spatialpandas
togeopandas
for reading the file was not straightforward. Thegeometry
column in the data file was not recognized bygeopandas
. To address this, I read the file usingspatialpandas
, converted it to a pandas DataFrame using.compute()
, and then transformed it into ageopandas
DataFrame. Thegeometry
column was converted to ashapely
object using a custom function. Finally, I saved this new DataFrame as a.parq
file, which was then read directly usinggeopandas
.Code:
Despite these adjustments, plotting the new data file proved challenging. Using
hvPlot.polygons
to plot the entire dataset takes over 5 minutes to run with minimal code:Testing with a small sample of the data (
gdf.head(1000)
) produced results in a reasonable time, suggesting thathvPlot
may have difficulties handling the large dataset (over 1 million rows).Additionally, plotting with extra parameters, such as adding color mapping to the
type
categories in the data results in a Traceback error.Complete, Minimal, Self-Contained Example Code that Reproduces the Issue
Stack traceback and/or browser JavaScript console output
Screenshots or screencasts of the bug in action