holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.26k stars 363 forks source link

dynspread/spread not working for datashaded plots with aggregator=ds.by(column, ds.any()) #1023

Open Noskario opened 2 years ago

Noskario commented 2 years ago

software version info

numpy 1.20.3 , pandas 1.3.3 , bokeh 2.3.3 , holoviews 1.14.6 , datashader 0.13.0 ,

Description of expected behavior and the observed behavior

I want to make a scatterplot (with color for categories) where there are some sparse points that I still want to see even if there are other regions with much higher density. I think ds.any() should be the way to go in this case. Unfortunately, when I use dynspread on this plot, the points disappear and I the whole plot that datashader produces gets a strange background-color. (Interestingly this color is not always the same...)

Have a look at the following example:

Complete, minimal, self-contained example code that reproduces the issue

import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
import datashader as ds
from datashader.colors import Sets1to3
from holoviews.operation.datashader import datashade,dynspread,spread

raw_data = [('Alice', 60, 'London', 5) ,
           ('Bob', 14, 'Delhi' , 7) ,
           ('Charlie', 66, np.NaN, 11) ,
           ('Dave', np.NaN,'Delhi' , 15) ,
           ('Eveline', 33, 'Delhi' , 4) ,
           ('Fred', 32, 'New York', np.NaN ),
           ('George', 95, 'Paris', 11)
            ]
# Create a DataFrame object
df = pd.DataFrame(raw_data, columns=['Name', 'Age', 'City', 'Experience'])
df['City']=pd.Categorical(df['City'])

x='Age'
y='Experience'
color='City'
cats=df[color].cat.categories

# Make dummy-points (currently the only way to make a legend: https://holoviews.org/user_guide/Large_Data.html)
color_key=[(name,color) for name, color in zip(cats,Sets1to3)]
color_points = hv.NdOverlay({n: hv.Points([0,0], label=str(n)).opts(color=c,size=0) for n,c in color_key})

# Create the plot with datashader
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),)#.redim.range(Age=(0,90), Experience=(0,14))
datashaded1=datashade(points,aggregator=ds.by(color)).opts(width=550, height=480)
datashaded2=datashade(points,aggregator=ds.by(color,ds.any())).opts(width=550, height=480)

dynspread(datashaded1)*color_points+dynspread(datashaded2)*color_points
# spread(datashade(points,aggregator=ds.by(color,ds.any())).opts(width=550, height=480))*color_points

We get the following result:

example_for_comparing

jbednar commented 2 years ago

Thanks for filing the issue! I can verify that I get the same result, but don't yet know what the cause is.

jbednar commented 2 years ago

I'm still not quite sure why this behavior happens, but the upcoming Datashader release, currently installable as a dev version using conda install -c pyviz/label/dev, should have two new features relevant to it:

Also, instead of any version of any (no pun intended), you can consider using min_alpha=128 (or similar), which should ensure that even isolated points are fully visible, at the expense of a lower dynamic range to indicate point density.