Proposal for a new HoloViz Datasets package

Supersedes https://github.com/holoviz/hvplot/issues/1274

Context

The HoloViz tools are very much data-oriented, pretty much every example across the board starts by either loading a dataset or creating one on the fly. Loading datasets is done in various ways:

From bokeh.sampledata. Bokeh ships some data and downloads some others. I don't know when we rely on the latter, but for sure we run bokeh sampledata quite often on the CI (to download these non-shipped datasets).
From xarray.tutorial.open_dataset
hvPlot has a sample_data module that exposes and Intake catalog and its datasets. It needs the optional deps intake, intake_parquet, intake_xarrayand s3fs. The catalog is shipped with hvPlot but the datasets are fetched from the internet.
From https://datasets.holoviz.org (e.g. https://datasets.holoviz.org/penguins/v1/penguins.csv)
From the repository (e.g. occupancy.csv in panel/examples/assets)
From some other locations on the internet, I'm sure
And probably from some other places too!

This isn't great:

Docs writer/maintainers are unsure of the approach to follow (for instance I'm wondering why we added penguins.csv to the datasets bucket if it is always shipped by bokeh sampledata, there's maybe a good reason)
Users can't easily learn how to access and use the datasets they see in the docs. Sometimes the code will error because they're missing some dependencies.
Datasets are being more and more downloaded from the internet which prevents offline execution.

There has to be a better way :) We started to discuss how to improve the situation by creating a new HoloViz package dedicated to giving a unified and easier approach to loading datasets.

How others do it

Bokeh

Bokeh has a sampledata sub-package that gives access to its datasets. They ship 24 files (CSVs mostly). More datasets are available after being downloaded via either the CLI bokeh sampledata or the Python API bokeh.sampledata.download(). Downloaded files are stored in $HOME/.bokeh/data (it can be configured). Datasets are in their sub-modules, sometimes grouped.

from bokeh.sampledata.airport_routes import airports, routes 
from bokeh.sampledata.penguins import data

Plotly

Plotly has a data sub-package that gives access to 11 datasets. They are all csv.gz files and are all shipped.

import plotly
df = plotly.data.gapminder()
# Some loading functions accept parameters
df = plotly.data.gapminder(datetimes=True)

# plotly and plotly.express share the same datasets files
import plotly.express as px
df = px.data.gapminder()

Altair

The vega-datasets package contains 17 shipped datasets and gives access to many more datasets that are downloaded from vega's CDN. Downloaded datasets are not cached. vega-datasets is not a dependency of altair, it has Pandas as a unique direct dependency. Its code and API are a little more elaborate.

from vega_datasets import data
print(data.iris.url, data.iris.filepath, data.iris.description)
print(data.list_datasets())
df = data.iris()

from vega_datasets import local_data
print(local_data.list_datasets())

Matplotlib

Matplotlib ships with about 15 datasets. The API is quite rudimentary:

import matplotlib.cbook as cbook
npz = cbook.get_sample_data('goog.npz')

Seaborn

Seaborn doesn't ship any datasets. Instead it offers the high-level loat_dataset function that fetches files (CSVs mostly) from this Github repository https://github.com/mwaskom/seaborn-data. Datasets are cached by default once downloaded. The cache location can be accessed via sns.get_data_home() and controlled by passing an argument to load_dataset or setting the env var SEABORN_DATA.

import seaborn as sns
df = sns.load_dataset("iris")

# list the datasets available
print(sns.get_dataset_names())

Xarray

Xarray has implemented an API similar to seaborn, available in the xarray.tutorial module. The main difference with seaborn is that running open/load_dataset requires pooch, that can be installed with pip install xarray[io] along with other deps like netCDF4, zarr or fsspec. Pooch itself depends on platformdirs, packaging and requests.

import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ds_loaded = xr.tutorial.load_dataset('air_temperature')

xarray.tutorial also has the scatter_example_dataset function that just generates a Dataset using Numpy functions.

Proposal

Name

Simon already reserved hvdata on PyPI, though I'm not sure we reached an agreement on the name of this package? I think Andrew called it hvdataset in its original issue (https://github.com/holoviz/hvplot/issues/1274).

I find hvdata a little too close to hvplot in the spirit and not super explicit. I prefer hvdatasets, even if I'm a bit annoyed it'll show up before hvplot in my editor.

API

Considering all the approaches above, I think my favorite API is when the datasets are available via a callable (plotly, vega-datasets) that returns a data object, as this enables autocompletion and allows us to augment their signature with optional parameters.

import hvd
df = hvd.penguins()

I also like what Plotly does by providing a unique interface to their datasets for Plotly and Plotly express.

import hvplot.pandas
df = hvplot.datasets.penguins()

import panel as pn
df = pn.datasets.penguins()

Datasets

The package will ship a list (to be determined) of small datasets. They should be as small as possible. We need to watch the overall package size, setting a limit upfront would be a good idea.

Other datasets can be downloaded on the fly. They're cached.

Dependencies

I'm not sure. None, or maybe just pandas? platformdirs is good for finding the right cache paths on various platforms, but I think we can do without it. The package will have optional dependencies that will be required for its test suite to pass (e.g. xarray).

If we want to increase the likelihood for users to be able to run code snippets without getting any errors, then the package should be added as a direct dependency to the HoloViz packages that use it? After all, Plotly, Bokeh, Matplotlib all ship with their package some datasets (and others like plotnine, and possibly many more), so why not us?

Unfortunately, I don't think there's a way yet in the pip world to define default extras. But if there was, I'd go with that option, making a hypothetical default datasets extra, with an option for users not to install it (e.g. to reduce their environment size).

If we don't make it a direct dependencies of some packages, then their documentation will have to explain that it needs to be installed (either directly or via a new extra or the existing recommended one).

Documentation

The package will have a simple website built with the usual Sphinx stack. It'll list all the datasets it contains, with a description and their license.

Great write-up!

'hvsampledata' would make it a bit more clear that this is data not about any holoviz thing, but a sample dataset to work with along side holoviz things. Although it's more of a mouthful, 'hvsampledata' would show up after hvplot in your editor :)

I like hvsampledata too.

I'm wondering about the API options:

df = hvplot.datasets.penguins()
df = hvplot.datasets.get_penguins()
df = hvplot.datasets.penguins # property
df = hvplot.datasets.penguins.load()
df = hvplot.datasets.load("penguins")
df = hvplot.datasets.use("penguins").load() # consider lazy load for some example use cases

I'd also like to encourage discoverability within the IDE with autocomplete, e.g. hvplot.datasets.penguins() is easier to discover hvplot.datasets.load("penguins")

I am 100% in favor of streamlining and simplifying how we access datasets. However, I'm not sure how the proposed datasets package would work, given the size of the datasets we use, e.g. for Datashader examples. Are you proposing that all the datasets would actually be stored within a single conda package? Seems like that would be an impractically large conda package. Maybe you could list the number and size of the datasets you'd expect to be included there?

Also, adding a new package is problematic because it would then need to go onto Anaconda Defaults for a defaults user to be able to get the examples, presumably? HoloViz tools in general are on defaults, which means anything they require also needs to be there. I don't think defaults is going to want a huge package, nor to add another package without a good reason.

Maybe we could split the difference and have a package on conda-forge but then have HoloViews have a function like the one in Bokeh, first checking if the datasets package is installed and if not, downloading that data directly from the internet?

I prefer "sampledata" over "data" or "datasets"
I prefer lazy downloads
I need the package working in pyodide too.
The download should be optimized for speed. I.e. use .csv.gz instead of .csv or similar.

To be honest I actually prefer a web site describing a set of datasets and their urls instead of a package. When I was a newbee I found those package a bit lit of hocus pokus. Probably because in the old days there was no description of the dataset and no type annotations. I.e. you did not really know what you got.

@jbednar, the small datasets will be in the package, but larger datasets will need to be downloaded and cached.

Also, a +1 on hvsampledata

My opinion is that the packages should have no dependencies at all. And let it be up to the other packages to specify the packages needed along with the hvsampledata. Currently, pandas are a dependency of most of our libraries, but this could change in the future. An option could also be to have a backend keyword, which has a default depending on what library you have installed, and add the option to specify it if needed: hvplot.datasets.penguins(backend="polars").

I like hvsampledata too.

However, I'm not sure how the proposed datasets package would work, given the size of the datasets we use, e.g. for Datashader examples. Are you proposing that all the datasets would actually be stored within a single conda package? Seems like that would be an impractically large conda package.

I'm thinking about shipping the typically small datasets that are used on the website of hvPlot or Panel, like iris, penguins, gapminders, air_temperature, etc. But the package can expose larger datasets that will be downloaded on the fly, and cached.

Maybe you could list the number and size of the datasets you'd expect to be included there?

Yes this list is needed. Hm maybe each project maintainer could come up with their wishlist?

Also, adding a new package is problematic because it would then need to go onto Anaconda Defaults for a defaults user to be able to get the examples, presumably? HoloViz tools in general are on defaults, which means anything they require also needs to be there. I don't think defaults is going to want a huge package, nor to add another package without a good reason.

For what it's worth, vega_datasets is on defaults (https://anaconda.org/anaconda/vega_datasets). I hope it's not going to be too hard to convince them to add a new package that doesn't require much maintenance. Most HoloViz users are anyway getting their package from PyPI or conda-forge, we can serve those without a problem.

For the larger datasets:

if we really want, we could have another package that ships larger datasets. hvsampledata would first try to load the large dataset from this package, and fall back to downloading it from the web if the large data package is not installed. It would be an optional dependency that would help air-gapped users.
The large data guide in HoloViews builds some interesting large dummy datasets (see the snippet below). hvsampledata could expose functions that return these dummy datasets, for us to avoid having to copy/paste this code in multiple places.

num=10000
np.random.seed(1)

dists = {cat: pd.DataFrame(dict([('x',np.random.normal(x,s,num)), 
                                 ('y',np.random.normal(y,s,num)), 
                                 ('val',val), 
                                 ('cat',cat)]))      
         for x,  y,  s,  val, cat in 
         [(  2,  2, 0.03, 10, "d1"), 
          (  2, -2, 0.10, 20, "d2"), 
          ( -2, -2, 0.50, 30, "d3"), 
          ( -2,  2, 1.00, 40, "d4"), 
          (  0,  0, 3.00, 50, "d5")] }

df = pd.concat(dists,ignore_index=True)
df["cat"]=df["cat"].astype("category")

To be honest I actually prefer a web site describing a set of datasets and their urls instead of a package.

Bokeh does a pretty good job at showing what the datasets contain: https://docs.bokeh.org/en/latest/docs/reference/sampledata.html. By making the datasets available through a callable, we can add in their docstring a link to the site so you can quickly have access to more information.

Currently, pandas are a dependency of most of our libraries, but this could change in the future. An option could also be to have a backend keyword, which has a default depending on what library you have installed, and add the option to specify it if needed: hvplot.datasets.penguins(backend="polars").

Yep, that's why I like to make the datasets available through a callable, it gives us some more flexibility.

Thanks for the excellent writeup @maximlt!

My opinion is that the packages should have no dependencies at all. And let it be up to the other packages to specify the packages needed

I agree with @Hoxbro on both counts: I think this package should be free to serve any format and any package using it should decide which ones they care about (which will probably be listed as dependencies already anyway).

I'm thinking about shipping the typically small datasets that are used on the website of hvPlot or Panel, like iris, penguins, gapminders, air_temperature, etc. But the package can expose larger datasets that will be downloaded on the fly, and cached.

Makes sense to me!

if we really want, we could have another package that ships larger datasets. hvsampledata would first try to load the large dataset from this package, and fall back to downloading it from the web if the large data package is not installed. It would be an optional dependency that would help air-gapped users.

Yes, good idea. This is an optional extension we should plan for even if we have trouble publishing such a large package. Note that maybe we don't necessarily need to publish a large data package: we could get the package to air-gapped users some other way if they need it.

As for the naming, I hate all the names so I won't get involved. :-)

I suppose a long and ugly (but descriptive) name like hvsampledata is probably going to be the best we can do.

+1 on hvsampledata

+5 on hvsampledata

I initially had a concern that this doesn't say "holoviz", so that people looking at it in their environments or in the packages included in Anaconda Distribution will have no idea what it might be (whereas vega_datasets is clearly related to vega).

But I think someone pointed out above that it would sort next to hvplot, which will nearly always be in the packages list since (a) conda defaults includes hvplot, and (b) we steer nearly everyone to install hvplot rather than holoviews. So my guess is if they see hvplot and hvsampledata they might well guess that this is to do with hvplot.

The alternatives I'd imagine are hvdatasets (maybe no better than hvsampledata) and a name that doesn't have hv in it at all, documenting it as a handy collection of sample data for any project to use, not just HoloViz.

a name that doesn't have hv in it at all, documenting it as a handy collection of sample data for any project to use, not just HoloViz.

I wouldn't be so much in favor of that, to avoid increasing our maintenance burden. Besides that, it sounds like you're not against hvsampledata, even if ideally you'd prefer if holoviz was in the name (holoviz_sampledata?). We can decide on the name in one of the next HoloViz meetings.

Next, I'll submit a list of the most used datasets currently across the HoloViz websites, to figure out which lightweight packages we should expose.

So after some not-perfect regex + pandas, here's the breakdown per package of the types/name of the datasets used:

geoviews:

type              target                           
bokeh.sampledata  airport_routes                       4
pandas            '../../assets/cities.csv',           4
xarray.tutorial   rasm                                 4
pandas            '../../assets/referendum.csv')\n"    2
                  '../assets/referendum.csv')\n"       2
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
holoviews:

type              target                                  
bokeh.sampledata  iris                                        10
                  autompg                                      7
pandas            'http://assets.holoviews.org/macro.csv',     7
bokeh.sampledata  airport_routes                               6
                  stocks                                       6
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
hvplot:

type              target         
xarray.tutorial   air_temperature    24
bokeh.sampledata  penguins           12
                  iris                8
                  autompg             8
hvplot            us_crime            5
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
lumen:

type    target                                       
pandas  cache_path,                                      1
        path),                                           1
        url).sample(5).reset_index(drop=True).to_csv(    1
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------
panel:

type              target                                                      
bokeh.sampledata  autompg                                                         11
                  population                                                       7
pandas            "https://assets.holoviz.org/panel/tutorials/turbines.csv.gz"     7
bokeh.sampledata  airport_routes                                                   3
datasets.holoviz  '[https://datasets.holoviz.org/penguins/v1/penguins.csv')\n](https://datasets.holoviz.org/penguins/v1/penguins.csv')/n)"      3
Name: count, dtype: int64
----------------------------------------------------------------------------------------------------

And the 15 most used datasets:

type              target                                                      
xarray.tutorial   air_temperature                                                 26
bokeh.sampledata  autompg                                                         26
                  iris                                                            21
                  airport_routes                                                  14
                  stocks                                                          13
                  penguins                                                        12
                  population                                                       7
pandas            'http://assets.holoviews.org/macro.csv',                         7
                  "https://assets.holoviz.org/panel/tutorials/turbines.csv.gz"     7
xarray.tutorial   rasm                                                             7
bokeh.sampledata  movies_data                                                      5
                  unemployment                                                     5
                  us_counties                                                      5
hvplot            us_crime                                                         5
bokeh.sampledata  periodic_table                                                   4

Pretty sure the analysis isn't perfect and is somewhat missing/not grouping the files uploaded on our S3. I am sure the penguins and windturbines datasets S3 are used enough times that they should appear in the results above. Anyway, I think it's enough to get started and discuss which datasets should be added to hvsampledata.

holoviz / holoviz